Tech:Incidents/2016-12-Database

DRAFT

The OOM killer on db2.miraheze.org killed the MariaDB process, triggering (slow) InnoDB recovery. Unknown ETAs, failed migrations, slow hardware, human errors, lack of appropriate VPSs by RamNode and SPOFs in the database infrastructure caused 4 days of downtime and slow sites.

Summary

 * What services were affected?
 * Primarily the MariaDB server on db2.miraheze.org
 * All services that relied on MariaDB (MediaWiki, Phabricator and Piwik)
 * How long was there a visible outage?
 * Sporadic outages/uptime from late 12/5 to afternoon 12/8 (3 days)
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * The MariaDB server on db2.miraheze.org was killed by the OS for using too much memory (Linux OOM killer)
 * MariaDB attempted to restart and recover data (it was killed while actively reading/writing)
 * The fact that db2 is hosted on an SSD-cached server instead of pure-SSD may have resulted in increased recovery time
 * Was the incident aggravated by human contact, users or investigating?
 * The incident was not initially caused by any human actions
 * MariaDB was killed during recovery in an attempt to change how it started, so that data on the server could be copied to db3.miraheze.org (a temporary, emergency server) in an effort to bring the entire website back online faster
 * Users did not aggravate the incident, although disabling ReadOnly on wikis caused the databases to have to be backed up again, this did not contribute to the incident itself.
 * If we did not stop MariaDB during the recovery, it would have taken approximately 2 hours to recover. This would cut down the downtime and issues by multiple days, so our behaviour may have made the problems worse. However, we could not check the status of the recovery process (https://jira.mariadb.org/browse/MDEV-6456), a fix for that was introduced in MariaDB 10.1.1 (https://github.com/MariaDB/server/commit/fa0628a6aca07af8e74be539e55940d891914041).
 * The lack of CVZ nodes in NL forced us to use a node in NYC, while this node has better performance than NL CVZ nodes, the network latency between NL and NYC caused many 503 errors.

Timeline
All times are in UTC.
 * December 5
 * 19:40: MariaDB was killed by the Linux OOM killer; Dec 5 19:40:52 db2 kernel: [12757747.830942] Out of memory in UB 58027: OOM killed process 13705 (mysqld) score 0 vm:4743528kB, rss:3056820kB, swap:376980kB. MariaDB was restarted automatically, and since this was a non-graceful shutdown, InnoDB crash recovery was attempted.
 * 19:41 Southparkfan: noticed MariaDB crashed, found out that the OOM killer was the cause.
 * 20:16: since the recovery of db2 is not expected to finish in a reasonable amount of time, Southparkfan decides that a new database server should be bought to replace db2 temporarily
 * 20:29 Southparkfan: put all wikis in read-only mode
 * 20:33 Southparkfan: stopped the InnoDB recovery, set innodb_force_recovery to 6, began dumping all databases
 * 21:14 John: bought db3.miraheze.org, a 2048MB CVZ in NYC
 * 21:50: NDKilla and John installed db3.miraheze.org, while Southparkfan is still focussing on making a backup of the databases
 * 22:13-23:19: the first batches were being uploaded/imported to db3
 * 23:55: after Southparkfan explained how to proceed from here, he left
 * 23:55 (December 5)-04:00 (December 6): NDKilla imported all databases on db3. MariaDB crashed while dumping allthetropeswiki, so allthetropeswiki was imported without the searchindex and objectcache tables. All wikis were editable again, except allthetropeswiki.


 * December 6
 * 15:00-15:10: Southparkfan was preparing to migrate back to db2
 * 15:20 Southparkfan: put all wikis in read-only mode to migrate back to db2
 * ~18:30 import (except allthetropeswiki) finished
 * 18:38 Southparkfan: restarting MariaDB a few times suddenly made it crash, InnoDB recovery was initiated automatically
 * 18:40: Southparkfan was not able to look into the issue for more time. He decided to switch MediaWiki to db3 again, removed read-only mode for all wikis (this time also for allthetropeswiki), and left (lack of time)


 * December 7
 * ~15:00-15:40: Southparkfan was looking into a performance comparison of db2 and db3. He noticed that db3's disk write speeds are 7x higher than db2's (33.8MB/s vs 4.7MB/s), and opened a ticket with RamNode to get db2 to a faster node. That was no problem, and RamNode started a migration.
 * 18:06: RamNode completed the db2 migration, well over their ETA of 45-60 minutes (it took 200 minutes). Southparkfan was not able to attempt another migration, so we did not try another migration.
 * The disk write speed improved by 1.8x. That is rather a disappointment, but after a conversation with RamNode and a discussion between Site Reliabilty Engineering, we decided to stay with db2.
 * The rest of the day was spent looping over the events of December 5 and December 6 to draw a conclusion regarding the cause of the database crash and failed migration. After analyzing the problems and errors made by Site Reliabilty Engineering, it was agreed Southparkfan would perform another migration the next day.


 * December 8
 * 17:45 Southparkfan: put all wikis in read-only mode for database server migration. He forgot to set the read-only flag in my.cnf, so he quickly killed Phabricator and Piwik by introducing errors in their configuration
 * 19:25: all databases (except allthetropeswiki) were imported
 * 19:50-20:10: allthetropeswiki imported
 * 20:13 Southparkfan: switched MediaWiki to db2. Asked several people to check the data integrity by visiting articles and special pages on multiple wikis
 * 20:35 Southparkfan: disabled read-only mode on wikis and switched Phabricator to db2

Quick facts

 * Regardless of the actual destructive process on db2, OOM killers on db* servers have been a problem since late 2015. Increasing RAM was mostly making the problem occurring less frequent.
 * The cause of the incident could have been prevented by using a db server with more memory or tuning MariaDB better, but we are trying to avoid increasing costs and we actively monitor memory usage on the server.
 * On October 18, a similar crash happened on db2, where recovery had to be applied manually: Tech:Incidents/2016-10-18-Database

Conclusions

 * The only way this issue would have been easily preventable was by getting extra RAM for db2, but the additional expenses were not reasonable.
 * Part of the issue may be choosing to host a database server on an SSD-cached server, but pure-SSD servers are exponentially more expensive for storage space, which we also need
 * Inaccurate estimations of MariaDB memory usage on db2 and/or RAM limit may have been contributing factors
 * The incident could potentially have been cut down to a few hours if we did nothing and let InnoDB recovery do its work, this was not done because we couldn't track the status of the "Reading tablespaces from the .ibd files" part (a feature introduced in MariaDB 10.1). Had we known it would take ~60 more minutes to recover, we could have saved a lot of time, stress and money.
 * An accurate treatment of the data and servers ensured we had zero data loss.

Actionables

 * Investigate how the OOM killer was triggered;
 * Upgrade to MariaDB 10.1;
 * Get a second database server (preferably outside RamNode) for data redundancy and failover purposes;
 * Not everyone in the Site Reliabilty Engineering team is in agreement this should happen.

Meta

 * Who responded to this incident?
 * Southparkfan, John, NDKilla
 * What services were affected?
 * All services that rely on the database server
 * MediaWiki servers frequently failed to load
 * Cache proxy servers frequently displayed 503/504 errors due to database downtime or latency
 * Phabricator was repeatedly/temporarily down throughout the incident
 * Piwik (analytics) was unable to collect/store information in it’s database during part of the incident
 * Who, therefore, needs to review this report?
 * All Site Reliabilty Engineering members
 * Timestamp: ...