Tech:Incidents/2017-10-28-Database

From Meta
Jump to navigation Jump to search

Db2 disk space was critical and reached the point where it was 0 MB, resulting in all it's dependencies (MediaWiki, Puppet) which lead to the site being down (503s). Same re-occurring issue as Tech:Incidents/2017-10-04-Database.

Summary[edit source]

  • What services were affected?
    • All services where dependent on db2 (MediaWiki and Puppet)
  • How long was there a visible outage?
    • 2017-10-28 09:07 UTC until 18:30 UTC (9 hours 23 minutes)
  • What was/were the response times by each Operations member?
    • Reception123 responded at 09:23 on IRC, emailed staff notifying them of the site being down and posted about the error on Twitter.
    • NDKilla deleted binary logs on db2 and therefore the issue was resolved..
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • Caused by disk space getting to 0 MB.
  • Was the incident aggravated by human contact, users or investigating?
    • Does not seem to be aggravated in any way.
  • How could response time by improved?
    • Response time was better than previously, with the error being fixed in 9 hours. It could be improved by current sysadmins being notified quicker of the downtime, and being able to act after that.

Timeline[edit source]

All times are in UTC.

  • 09:07: The backends are sick and all wikis go down with 503 Backend Fetch Failed error
  • 09:23: Reception123 notifies sysadmins via IRC and email about the error.
  • 18:30 NDKilla deletes binary logs, and therefore the wikis go back up.

Quick facts[edit source]

  • Db2 was close to critical for a long time, only it suddenly went from about 1.5 GB to 0 very quickly

Conclusions[edit source]

  • The incident could have been prevented if binary logs were deleted and wikis were moved to db3 before db2 getting to 0 MB

Reporting[edit source]

  • What services/sites were used to report the downtime?
    • Twitter, IRC (topic)
  • What other services/sites were available for reporting, but were not used?
    • Facebook

Actionables[edit source]

Permanent fix
  • Allow CreateWiki to create databases on other servers other than db2.
  • Store binlogs for a shorter amount of time (Yes check.svg Done, changed from 14 to 5 days)
Others
  • Increase response time of Operations
  • Have more volunteers/operations to be able to respond to these situations, and monitor the servers so that they do not reach this point. (Yes check.svg Done added Reception123 as operations)
  • Manually move more wikis to db3 until number 1 on this list is resolved (Yes check.svg Done moved a few other large wikis to db3)

Meta[edit source]

  • Who responded to this incident?
    • Reception123, NDKilla
  • What services were affected?
    • All services where dependent on db2 (MediaWiki and Puppet)
  • Who, therefore, needs to review this report?
    • All Operations members
  • Timestamp: ...