Tech:Incidents/2016-10-18-Database

Summary
A search query crashed MariaDB on October 18, getting MariaDB into a state corruption was caused and it refused to restart.

Timeline
All times are in UTC.
 * October 18
 * 17:17: MariaDB crashes
 * 17:18 PuppyKun: noticed two DB errors and repeated 503s, notified JohnLewis
 * 17:18 JohnLewis: acknowledges errors and tries to restart MariaDB
 * ~17:30 JohnLewis: reboots db2
 * 19:16 JohnLewis: alerts Southparkfan via e-mail
 * 19:19 Southparkfan joins IRC
 * 19:46 Southparkfan: tries to find out which databases are corrupt using "mysqlcheck --all-databases", but needs to get MariaDB running first
 * 20:24 Southparkfan: rsyncing /srv/mariadb @ db2 to the same server into another location, backup measure before trying to touch the actual databases
 * 21:10-21:41: John and Southparkfan issue a Bacula job to restore the most recent database backup (2 days old) on db1
 * 22:09 Southparkfan: reports db2 backup is complete
 * 22:45: Bacula restore progress is 39%
 * October 19
 * 01:05: Bacula restore is done
 * 01:10 Southparkfan: tries to restart MariaDB on db1 with the Bacula backup, but MariaDB refuses to restart here as well
 * 01:54 Southparkfan goes to sleep
 * 03:58 NDKilla: set InnoDB force recovery to level 6 on db2, MariaDB came alive
 * 05:02 NDKilla: stopped MariaDB on db2
 * 11:01: NDKilla and Southparkfan are having a conversation about level 6 InnoDB recovery
 * 11:29-11:57 Southparkfan: manually running mysqld_safe on db2 with level 6 InnoDB recovery, dumping metawiki on db2 and working on getting scp work between db2 and db1
 * 11:57 Southparkfan: moved corrupt Bacula backup out of the way, re-installed MariaDB on db1
 * ~11:58 Southparkfan: started importing metawiki dump on db1
 * ~11:58 Southparkfan: changed DB server to db1 on mw1, depooled mw2
 * 12:00: confirmed metawiki was back online
 * 12:08 Southparkfan: commited db1 patch, repooled mw2
 * 12:00-12:30 Southparkfan: recovered several other wikis to see if my method was working
 * 13:09 Southparkfan: starting the first recovery batch. Dumped 267 wikis, transferred those to db1 and importing them all there
 * From now on Southparkfan was running batches (~200-250 wikis each batch) all the time.
 * 14:50: 505 wikis imported
 * 14:55 Southparkfan: deleted Bacula backup from db1
 * 16:41: 1168 wikis imported
 * 17:50: all wikis (except All The Tropes) imported
 * 18:00: Operations decided to continue operating in read-only mode, after ATT has been backupped to db1 we'll be working on transferring db1's SQL files to db2 again
 * 18:55 Southparkfan: a compressed version of ATT's database dump has been transferred to db1
 * 19:00-19:34: Southparkfan was not sure if all databases were imported, some valuable time was wasted here
 * 19:41 Southparkfan: gives John an OK for re-installing db2
 * 19:42-~21:55 JohnLewis: re-installing db2, and importing all files (except ATT) from db1 to db2
 * 21:59 JohnLewis: remove read-only flag for all non-ATT wikis
 * 22:36 JohnLewis: removed read-only flag for ATT and global site notice
 * 22:43: JohnLewis declares 'migration done'
 * 22:43: incident is over