@andrew health_and_safety flash_onAdmin
First off, let me apologize for the downtime we saw yesterday. For approximately 4 hours, about 40% of users were unable to reliably access the site.
There were multiple issues that happened at once, resulting in a perfect storm that made debugging quite difficult: once one issue was fixed, another popped up. Here's what happened:
-
A configuration issue with our host removed a machine dedicated to sending forum emails to users. Instead of not sending any emails, our main machine (the one that powers the site) took over, which tied up resources that would otherwise be dedicated to running the site smoothly. The forums were temporarily disabled while this was debugged, and are now online again. I'm discussing the issue with our hosting provider that runs the machines to ensure that it does not happen again.
-
We experienced a minor DDoS from a user that repeatedly attempted to upload the same maliciously-crafted 5GB file over and over again. This user has been banned and we'll be implementing better protection against this soon.
-
A few users seem to have scripted a program to create tens of thousands of pages on their behalf, which also refreshed the entire list after each creation and stole significant amounts of server resources from legitimate users. These users have been given the option to export their notebook, but are banned from creating new pages to ensure that people who legitimately use the site are offered the resources they need to do so.
-
There still seems to be some kind of configuration issue with how our host is managing the new Postgres database we introduced a few days ago. I am actively working with them to solve these issues, and fighting slowness on the site along the way.
Any of these issues individually wouldn't have caused such long downtime, but the three of them together caused issues across the machines, the database, and routing through our hosting provider.
Again, I'm very sorry for this downtime. As a silver lining, the database upgrade we did the other day made it easier to diagnose pieces of each of the issues and respond quicker.
The issues seem to be resolved now, and I will continue to monitor server uptime and performance. I will also be bringing aboard an expert in infrastructure to better protect against incidents like this in the future.
Thank you everyone, and happy worldbuilding.
andrew (Our Supreme Lord and Overseer)