Tech:Incidents/2018-04-26-DataLoss

Summary

 * What services were affected?
 * MediaWiki, visual farm wide outage for a period of time.
 * How long was there a visible outage?
 * ~11 minutes
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Human error. Paladox deleted the wrong database at 22:29 (testwiki) and realised immediately after.
 * Was the incident aggravated by human contact, users or investigating?
 * Can not be aggravated at all.

Timeline
All times are in UTC.


 * April 27
 * [22:28:45] <+paladox>	!log DELETE FROM cw_wikis WHERE wiki_dbname = "testwiki"; on db2
 * [22:29:46] <+paladox>	!log /srv/mediawiki/w/extensions/MirahezeMagic/maintenance/removeDeletedWikis.php --wiki testwiki on mw1
 * [22:29:53] <+paladox>	!log delete db testwiki from db4
 * [22:38:22] 	testwiki should be back on db2 now paladox
 * [22:39:10] <+paladox>	Voidwalker i need to move the db over to db4
 * [22:41:39] <+paladox>	Works now
 * [22:41:41] <+paladox>	Voidwalker ^^

Quick facts

 * Are there any known issues with the service in production?


 * No


 * Was the cause preventable by us?


 * If we had a backup we could have restored from that.


 * Have there been any similar incidents?

Conclusions

 * Was the incident preventable? If so, how?


 * Yes, by the DROP DATABASE adding a confirm prompt.


 * Is the issue rooted in our infrastructure design?
 * State any weaknesses and how they can be addressed.
 * State any strengths and how they prevented or assisted in investigating the incident.

Actionables

 * Generate a backup.


 * Make a sql dump of the database your going to delete before deleting it.

Meta

 * Who responded to this incident?
 * Paladox
 * What services were affected?
 * MediaWiki, visually for a period of time.
 * testwiki for data loss.
 * Who, therefore, needs to review this report?
 * Timestamp.