Tech:Incidents/2018-10-26-all-wikis-down

From Meta
Jump to navigation Jump to search

Summary[edit source]

Provide a summary of the incident:

  • What services were affected?
    • MediaWiki
  • How long was there a visible outage?
    • 101 minutes
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • Yes, it was caused by human error.
  • Was the incident aggravated by human contact, users or investigating?
    • No.

Timeline[edit source]

Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.

[01:12] Paladox: Paladox ran an SQL command to replace a userright. The command had the unintended consequence of removing all user rights assigned to the * group instead.

[01:23] John: John notifies about the outage on irc.

[01:25] Paladox: Paladox and John then decide to restore mw_permissions.ibd which we had backed up to bacula1. This stored some json we copied from.

[01:41] John: Notifies Paladox that the file is restored.

[02:05] Paladox: Paladox then runs a sql command that restored the json to metawiki.

[02:42] Paladox: Paladox runs a script to restore all users rights on * group. This was only for public wikis.

[02:49] Paladox: Paladox runs a script to restore all users rights on * group. This was only for private wikis.

[02:49] Paladox: Outage ends


Quick facts[edit source]

Provide any relevant quick facts that may be relevant:

  • Are there any known issues with the service in production?
    • Nope.
  • Was the cause preventable by us?
    • Yes.
  • Have there been any similar incidents?
    • No.

Conclusions[edit source]

Provide conclusions that have been drawn from this incident only:

  • Was the incident preventable? If so, how?
    • Yes, i should have made the sql command was ok.
  • Is the issue rooted in our infrastructure design?
    • nope.
  • State any weaknesses and how they can be addressed.
    • No weaknesses.
  • State any strengths and how they prevented or assisted in investigating the incident.

Actionables[edit source]

List all things we can do immediately (or in our current state) to prevent this occurring again. Include links to Phabricator issues which should go into more detail, these should only be one line notes! e.g. "<link|#1>: Monitor service responses with GDNSD and pool/depool servers based on these."

Meta[edit source]

  • Who responded to this incident?
    • Paladox and John
  • What services were affected?
    • MediaWiki
  • Who, therefore, needs to review this report?
    • John
  • Timestamp.