Tech:Incidents/2019-02-28-mariadb-outage

From Meta
Jump to navigation Jump to search

Summary[edit source]

Provide a summary of the incident:

  • What services were affected?
    • Grafana, IcingaWeb2, MediaWiki, Mariadb and Phabricator.
  • How long was there a visible outage?
    • 20 minutes
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • yes
  • Was the incident aggravated by human contact, users or investigating?
    • No

Timeline[edit source]

  • 15:14 icinga-miraheze: Icinga started reporting that mw* was depooled from cp*
  • 15:16 Voidwalker: Voidwalker start's notifying paladox about the icinga reports that mw* is depooled.
  • 15:16 Paladox: I start investigating why mw* is depooled.
  • 15:17 Paladox: I looked in the logs and noticed mysql errors. I then realise that the db server has ran out of storage.
  • 15:19 Paladox: I stop mysql on db4 and prepare to remove the bin logs to free space.
  • 15:25 Paladox: I remove the bin logs to free space (after searching to see if there was a safe way to do that)
  • 15:41 icinga-miraheze: icinga-miraheze report's that the db is back. It took a while to start.

Quick facts[edit source]

Provide any relevant quick facts that may be relevant:

  • Are there any known issues with the service in production?
    • The database server is low on disk space.
  • Was the cause preventable by us? With more foresight potentially.
    • Moving search index to ElasticSearch.
  • Have there been any similar incidents?
    • Yes. It's ran out of storage before.

Conclusions[edit source]

Provide conclusions that have been drawn from this incident only:

  • Was the incident preventable? If so, how?
    • Moving the search index to elasticsearch we could reduce the space on the db server.
  • Is the issue rooted in our infrastructure design?
    • No.
  • State any weaknesses and how they can be addressed.
    • None.
  • State any strengths and how they prevented or assisted in investigating the incident.
    • None.

Actionables[edit source]

Meta[edit source]

  • Who responded to this incident?
    • Paladox
  • What services were affected?
    • Grafana, IcingaWeb2, MariaDB, MediaWiki and Phabricator.
  • Who, therefore, needs to review this report?
    • Operations
  • Timestamp.

Paladox (talk) 23:13, 28 February 2019 (UTC)