Tech:Incidents/2019-02-28-mariadb-outage

Summary
Provide a summary of the incident:
 * What services were affected?
 * Grafana, IcingaWeb2, MediaWiki, Mariadb and Phabricator.
 * How long was there a visible outage?
 * 20 minutes
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * yes
 * Was the incident aggravated by human contact, users or investigating?
 * No

Timeline

 * 15:14 icinga-miraheze: Icinga started reporting that mw* was depooled from cp*
 * 15:16 Voidwalker: Voidwalker start's notifying paladox about the icinga reports that mw* is depooled.
 * 15:16 Paladox: I start investigating why mw* is depooled.
 * 15:17 Paladox: I looked in the logs and noticed mysql errors. I then realise that the db server has ran out of storage.
 * 15:19 Paladox: I stop mysql on db4 and prepare to remove the bin logs to free space.
 * 15:25 Paladox: I remove the bin logs to free space (after searching to see if there was a safe way to do that)
 * 15:41 icinga-miraheze: icinga-miraheze report's that the db is back. It took a while to start.

Quick facts
Provide any relevant quick facts that may be relevant:
 * Are there any known issues with the service in production?
 * The database server is low on disk space.


 * Was the cause preventable by us? With more foresight potentially.
 * Moving search index to ElasticSearch.


 * Have there been any similar incidents?
 * Yes. It's ran out of storage before.

Conclusions
Provide conclusions that have been drawn from this incident only:
 * Was the incident preventable? If so, how?
 * Moving the search index to elasticsearch we could reduce the space on the db server.


 * Is the issue rooted in our infrastructure design?
 * No.


 * State any weaknesses and how they can be addressed.
 * None.


 * State any strengths and how they prevented or assisted in investigating the incident.
 * None.

Actionables

 * Move search index to ElasticSearch which is tracked at https://phabricator.miraheze.org/T4024.

Meta

 * Who responded to this incident?
 * Paladox


 * What services were affected?
 * Grafana, IcingaWeb2, MariaDB, MediaWiki and Phabricator.


 * Who, therefore, needs to review this report?
 * Site Reliabilty Engineering

Paladox (talk) 23:13, 28 February 2019 (UTC)
 * Timestamp.