Tech:Incidents/2018-04-26-DataLoss

Paladox accidentally deleted and dropped testwiki.

Summary

 * What services were affected?
 * MediaWiki, visual farm wide outage for a period of time.
 * How long was there a visible outage?
 * ~11 minutes
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Human error. Paladox deleted the wrong database at 22:29 (testwiki) and realised immediately after.
 * Was the incident aggravated by human contact, users or investigating?
 * Can not be aggravated at all.

Timeline
All times are in UTC.


 * April 27
 * [22:28:45] <+paladox>	!log DELETE FROM cw_wikis WHERE wiki_dbname = "testwiki"; on db2
 * [22:29:46] <+paladox>	!log /srv/mediawiki/w/extensions/MirahezeMagic/maintenance/removeDeletedWikis.php --wiki testwiki on mw1
 * [22:29:53] <+paladox>	!log delete db testwiki from db4
 * [22:38:22] 	testwiki should be back on db2 now paladox
 * [22:39:10] <+paladox>	Voidwalker i need to move the db over to db4
 * [22:41:39] <+paladox>	Works now
 * [22:41:41] <+paladox>	Voidwalker ^^

Quick facts

 * Are there any known issues with the service in production?
 * No
 * Was the cause preventable by us?
 * If we had a backup we could have restored from that.
 * Have there been any similar incidents?
 * No

Conclusions

 * Was the incident preventable? If so, how?
 * Yes, by the DROP DATABASE adding a confirm prompt.
 * Is the issue rooted in our infrastructure design?
 * State any weaknesses and how they can be addressed.
 * Once the DROP DATABASE command is sent, there is no way to cancel, and the wiki is deleted forever.
 * State any strengths and how they prevented or assisted in investigating the incident.
 * Not applicable.

Actionables

 * No actionable known.

Meta

 * Who responded to this incident?
 * Paladox
 * What services were affected?
 * MediaWiki, visually for a period of time.
 * testwiki for data loss.
 * Who, therefore, needs to review this report?
 * All Site Reliabilty Engineering members.
 * Timestamp: ...