Tech:Incidents/2017-10-04-Database

DRAFT Db2 disk space was critical and reached the point where it was 0 MB, resulting in all it's dependencies (MediaWiki, Puppet) which lead to the site being down (503s).

Summary

 * What services were affected?
 * All services where dependent on db2 (MediaWiki and Puppet)
 * How long was there a visible outage?
 * 2017-10-04 09:11 UTC until 13:45 UTC (4.5 hours)
 * What was/were the response times by each Site Reliabilty Engineering member?
 * revi responded at 12:40 on IRC and emailed staff notifying them of the site being down
 * revi and NDKilla attempted to find the source of the error
 * Southparkfan investigated the error and concluded that the source was db2 disk space being full, and deleted binary logs.
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Caused by disk space getting to 0 MB.
 * Was the incident aggravated by human contact, users or investigating?
 * Does not seem to be aggravated in any way.
 * How could response time by improved?
 * Response time was better than previously, with the error being fixed in 4.5 hours. It could be improved by current sysadmins being notified quicker of the downtime, and being able to act after that.

Timeline
All times are in UTC.
 * 09:11: The backends are sick and all wikis go down with 503 Backend Fetch Failed error
 * 12:40: revi notifies sysadmins via IRC and email about the error.
 * 13:37: Southparkfan concludes that db2 is full and therefore causing the errors
 * 13:44: Southparkfan deletes some binary logs, and therefore the wikis go back up.

Quick facts

 * Db2 was close to critical for a long time, only it suddenly went from about 2 GB to 0 very quickly

Conclusions

 * To be filled in by Site Reliabilty Engineering

Reporting

 * What services/sites were used to report the downtime?
 * Twitter, IRC (topic)
 * What other services/sites were available for reporting, but were not used?
 * Facebook

Actionables

 * Permanent fix
 * Allow CreateWiki to create databases on other servers other than db2.
 * Store binlogs for a shorter amount of time (✅, changed from 14 to 5 days)
 * Others
 * Increase response time of Site Reliabilty Engineering
 * Have more volunteers/operations to be able to respond to these situations, and monitor the servers so that they do not reach this point.

Meta

 * Who responded to this incident?
 * Revi, NDKilla, Southparkfan
 * What services were affected?
 * All services where dependent on db2 (MediaWiki and Puppet)
 * Who, therefore, needs to review this report?
 * All Site Reliabilty Engineering members
 * Timestamp: ...