Tech:Incidents/2018-09-08-all-wikis-down

From Meta
Jump to navigation Jump to search

Summary[edit source]

Provide a summary of the incident:

  • What services were affected?
    • Mediawiki
  • How long was there a visible outage?
    • 5-10 minutes
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • It was caused by human error
  • Was the incident aggravated by human contact, users or investigating?
    • No

Timeline[edit source]

Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.

I (paladox) committed this change:

  • [18:45:59] <Not-a634> [miraheze/MatomoAnalytics] paladox pushed 1 commit to master [+0/-0/±1] https://git.io/fA2z3
  • [18:46:01] <Not-a634> [miraheze/MatomoAnalytics] paladox baadc3f - Update hidden text to use "matomo"
  • [18:46:34] <Not-a634> [miraheze/mediawiki] paladox pushed 1 commit to REL1_31 [+0/-0/±2] https://git.io/fA2zG
  • [18:46:36] <Not-a634> [miraheze/mediawiki] paladox b1b1c34 - Update MA

@ [18:55:47] i (paladox) notice that metawiki is down.

Voidwalker pinged me (paladox) @ [18:56:57] to icinga reports of wikis being down:

  • [18:56:39] <icinga-miraheze> PROBLEM - misc1 GDNSD Datacenters on misc1 is CRITICAL: CRITICAL - 6 datacenters are down: 107.191.126.23/cpweb, 2604:180:0:33b::2/cpweb, 81.4.109.133/cpweb, 2a00:d880:5:8ea::ebc7/cpweb, * 172.104.111.8/cpweb, 2400:8902::f03c:91ff:fe07:444e/cpweb
  • [18:56:51] <icinga-miraheze> PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 107.191.126.23/cpweb, 2604:180:0:33b::2/cpweb, 81.4.109.133/cpweb, 2a00:d880:5:8ea::ebc7/cpweb, 172.104.111.8/cpweb, 2400:8902::f03c:91ff:fe07:444e/cpweb
  • [18:56:57] <+Voidwalker> paladox ^ there you go
  • [18:57:05] <icinga-miraheze> PROBLEM - cp2 Varnish Backends on cp2 is CRITICAL: 3 backends are down. mw1 mw2 mw3
  • [18:57:31] <icinga-miraheze> PROBLEM - cp5 Varnish Backends on cp5 is CRITICAL: 3 backends are down. mw1 mw2 mw3
  • [18:57:47] <icinga-miraheze> PROBLEM - cp4 Varnish Backends on cp4 is CRITICAL: 3 backends are down. mw1 mw2 mw3
  • [18:58:57] <icinga-miraheze> PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is CRITICAL: CRITICAL - NGINX Error Rate is 82%
  • [18:59:09] <icinga-miraheze> PROBLEM - cp2 HTTP 4xx/5xx ERROR Rate on cp2 is CRITICAL: CRITICAL - NGINX Error Rate is 94%
  • [19:00:43] <+paladox> err it's not comming back up :/
  • [19:01:49] <icinga-miraheze> PROBLEM - cp5 HTTP 4xx/5xx ERROR Rate on cp5 is CRITICAL: CRITICAL - NGINX Error Rate is 67%

@ [20:02:24] i (paladox) see the fatal exception after forcing cp4 to make mw1 healthy.

i (paladox) then find the error and know the fix:

  • [19:03:26] <Not-a634> [miraheze/mediawiki] paladox pushed 1 commit to REL1_31 [+0/-0/±1] https://git.io/fA2z5
  • [19:03:28] <Not-a634> [miraheze/mediawiki] paladox 50c738e - Update MW
  • [19:03:32] <+paladox> it was a stupid mistake i re reverted the change i reverted by SPF|Cloud
  • Wikis then start to recover after committing that change.

Quick facts[edit source]

Provide any relevant quick facts that may be relevant:

  • Are there any known issues with the service in production?
    • Nope
  • Was the cause preventable by us?
    • Nope
  • Have there been any similar incidents?
    • Nope

Conclusions[edit source]

Provide conclusions that have been drawn from this incident only:

  • Was the incident preventable? If so, how?
    • I should have ran git submodule update after reverting a submodule update.
  • Is the issue rooted in our infrastructure design?
    • Nope.
  • State any weaknesses and how they can be addressed.
    • Nope.
  • State any strengths and how they prevented or assisted in investigating the incident.
    • None

Actionables[edit source]

List all things we can do immediately (or in our current state) to prevent this occurring again. Include links to Phabricator issues which should go into more detail, these should only be one line notes! e.g. "<link|#1>: Monitor service responses with GDNSD and pool/depool servers based on these."

  • Make sure to run git submodule update after reverting a submodule update

Meta[edit source]

  • Who responded to this incident?
    • Paladox
  • What services were affected?
    • mediawiki
  • Who, therefore, needs to review this report?
    • Ops (john)
  • Timestamp.
    • First noticed by me (paladox) at 18:55:47.