Tech:Incidents/2017-10-28-Database

Db2 disk space was critical and reached the point where it was 0 MB, resulting in all it's dependencies (MediaWiki, Puppet) which lead to the site being down (503s). Same re-occurring issue as Tech:Incidents/2017-10-04-Database.

Summary

 * What services were affected?
 * All services where dependent on db2 (MediaWiki and Puppet)
 * How long was there a visible outage?
 * 2017-10-28 09:07 UTC until 18:30 UTC (9 hours 23 minutes)
 * What was/were the response times by each Site Reliabilty Engineering member?
 * Reception123 responded at 09:23 on IRC, emailed staff notifying them of the site being down and posted about the error on Twitter.
 * NDKilla deleted binary logs on db2 and therefore the issue was resolved..
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Caused by disk space getting to 0 MB.
 * Was the incident aggravated by human contact, users or investigating?
 * Does not seem to be aggravated in any way.
 * How could response time by improved?
 * Response time was better than previously, with the error being fixed in 9 hours. It could be improved by current sysadmins being notified quicker of the downtime, and being able to act after that.

Timeline
All times are in UTC.
 * 09:07: The backends are sick and all wikis go down with 503 Backend Fetch Failed error
 * 09:23: Reception123 notifies sysadmins via IRC and email about the error.
 * 18:30 NDKilla deletes binary logs, and therefore the wikis go back up.

Quick facts

 * Db2 was close to critical for a long time, only it suddenly went from about 1.5 GB to 0 very quickly

Conclusions

 * The incident could have been prevented if binary logs were deleted and wikis were moved to db3 before db2 getting to 0 MB

Reporting

 * What services/sites were used to report the downtime?
 * Twitter, IRC (topic)
 * What other services/sites were available for reporting, but were not used?
 * Facebook

Actionables

 * Permanent fix
 * Allow CreateWiki to create databases on other servers other than db2.
 * Store binlogs for a shorter amount of time (✅, changed from 14 to 5 days)
 * Others
 * Increase response time of Site Reliabilty Engineering
 * Have more volunteers/operations to be able to respond to these situations, and monitor the servers so that they do not reach this point. (✅ added Reception123 as operations)
 * Manually move more wikis to db3 until number 1 on this list is resolved (✅ moved a few other large wikis to db3)

Meta

 * Who responded to this incident?
 * Reception123, NDKilla
 * What services were affected?
 * All services where dependent on db2 (MediaWiki and Puppet)
 * Who, therefore, needs to review this report?
 * All Site Reliabilty Engineering members
 * Timestamp: ...