Tech:Incidents/2017-04-20-Database

DRAFT

For Operations to fill in with the exact details

Summary
For Operations to fill in with the exact details For Operations to fill in with the exact details To be filled in by Operations
 * What services were affected?
 * Db2
 * All services that relied on db2 (MediaWiki, Piwik)
 * How long was there a visible outage?
 * From 20/04/2017 1:50 UTC until 4:44 PM UTC (1 day, 15 hours)
 * What was/were the response times by each Operations member?
 * NDKilla responded at 2:46 UTC and tried to recover the database, unsuccessfully
 * John responded at 8:59 UTC and recovers the database for 6 minutes, after which is crashes again
 * Southparkfan responded at 14:25 UTC and successfully fixed the issue
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Was the incident aggravated by human contact, users or investigating?
 * How could response time by improved?

Timeline
All times are in UTC. DRAFT
 * April 20
 * 1:50: Void|Void notices this error: "(Cannot access the database: Cannot access the database: Unknown database 'delete1wiki' (81.4.125.112))" and informs sysadmins via IRC
 * 2:46: PuppyKun sees the error and attempts to resolve it. again, details to be provided by sysadmin/operations
 * 8:59: John recovers the database
 * 9:05: database connection fails again
 * 14:25: Southparkfan starts investigating the issue on db2
 * 15:58: Southparkfan restores db2 and temporarily makes all wikis read-only
 * 16:49: Recovery is successful and read-only is removed from all wikis

Quick facts
To be filled in by Operations

Conclusions
To be filled in by Operations

Actionables
To be filled in by Operations

Meta

 * Who responded to this incident?
 * Southparkfan, John, NDKilla
 * What services were affected?
 * All services that rely on the database server
 * MediaWiki servers frequently failed to load
 * Cache proxy servers frequently displayed 503/504 errors due to database downtime or latency
 * Piwik (analytics) was unable to collect/store information in it’s database during part of the incident
 * Who, therefore, needs to review this report?
 * All Operations members
 * Timestamp: ...