Tech:Incidents/2016-02-02-MWExceptions

From Meta
Jump to navigation Jump to search

For an unknown reason mw1 got an invalid localisation cache. Since this server is used in production, wikis depending on various extensions with magic words went down if the requests weren't served by mw2 (which seemed fine).

Description[edit source]

John and I have been working the evening before on making MediaWiki redundant. This process consisted of load balancing traffic between mw1 and mw2, and ensuring we had an NFS mount for mediawiki-static. Due to some small (but unrelated) NFS problems John re-installed mw2, and it was up to me to ensure MediaWiki would work (after installing a MediaWiki server we still manually need to build a localisation cache) on that. Running rebuildLocalisationCache.php on a server that doesn't contain a cache doesn't work (it ends with an error that the localisation cache is missing, ironic), so in the past I always used this small piece of code to have a temporary workaround. After a localisation cache exists I always removed that line afterwards, and this method always worked. However, since it could become annoying in the future to do that, I thought 453830f6961a42ca055a2b7054050932713e4c20 would work nice. In the past I had a lot of trouble with a similar piece of code, but I thought this worked (and in the beginning it seemed that it *did* work).

A few hours later, at 00:00:20, the first MWExceptions joined Miraheze:

2016-02-02 00:00:20 mw1 allthetropeswiki: [da053abd] /wiki/File:4-tails_2841.png   MWException from line 337 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'css'
#0 /srv/mediawiki/w/includes/MagicWord.php(262): MagicWord->load(string)
#1 /srv/mediawiki/w/includes/parser/Parser.php(5212): MagicWord::get(string)
#2 /srv/mediawiki/w/extensions/CSS/CSS.php(124): Parser->setFunctionHook(string, string)
#3 [internal function]: wfCSSParserFirstCallInit(Parser)
#4 /srv/mediawiki/w/includes/Hooks.php(201): call_user_func_array(string, array)
#5 /srv/mediawiki/w/includes/parser/Parser.php(328): Hooks::run(string, array)
#6 /srv/mediawiki/w/includes/cache/MessageCache.php(1035): Parser->firstCallInit()
#7 /srv/mediawiki/w/includes/cache/MessageCache.php(1064): MessageCache->getParser()
#8 /srv/mediawiki/w/includes/Message.php(1110): MessageCache->parse(string, Title, boolean, boolean, Language)
#9 /srv/mediawiki/w/includes/Message.php(792): Message->parseText(string)
#10 /srv/mediawiki/w/includes/Message.php(848): Message->toString()
#11 /srv/mediawiki/w/includes/page/ImagePage.php(403): Message->parse()
#12 /srv/mediawiki/w/includes/page/ImagePage.php(154): ImagePage->openShowImage()
#13 /srv/mediawiki/w/includes/actions/ViewAction.php(44): ImagePage->view()
#14 /srv/mediawiki/w/includes/MediaWiki.php(490): ViewAction->show()
#15 /srv/mediawiki/w/includes/MediaWiki.php(287): MediaWiki->performAction(ImagePage, Title)
#16 /srv/mediawiki/w/includes/MediaWiki.php(714): MediaWiki->performRequest()
#17 /srv/mediawiki/w/includes/MediaWiki.php(508): MediaWiki->main()
#18 /srv/mediawiki/w/index.php(41): MediaWiki->run()
#19 {main}

The following wikis were affected (along with exceptions count):

root@mw1:/var/log/mediawiki/debuglogs# grep -a 'Error: invalid magic word' exception.log | awk '{print $4}' | cut -d":" -f1 | sort | uniq -c | sort -nr
   5910 allthetropeswiki
    878 spiralwiki
    512 boulderwikiwiki
    453 metawiki
    381 poserdazfreebieswiki
    274 aktposwiki
     83 antiguabarbudacalypsowiki
     69 testwiki
     61 specialeducationwiki
     46 tyrolmountainswiki
     29 cssandjsschoolboardwiki
     22 partupwiki
     10 szkwiki
      4 reviwiki
      2 rawdatawiki
      2 quantixwiki
      2 developmentwiki
      1 worldofkirbycraftwiki
      1 webflowwiki
      1 stoutofreachwiki
      1 spiraltestwiki
      1 poserdazfreebiestestwiki
      1 micropediawiki

At 06:48:13 the MWExceptions stopped because I rebuilt the localisation cache. A few hours later I depooled mw1 for troubleshooting.

None of us know what exactly happened, and why it happened. I think that this commit could have been the culprit here.

The sad part about this downtime is that it took 7 hours before someone actually noticed it and fixed the issue. System administrators should get notified if something happens, but since we have no monitoring (neither do we have a Varnish setup that automatically depooled mw1), no one took action unfortunately.

Timeline[edit source]

1 February
2 February
  • 00:00 Southparkfan: first MWExceptions by mw1
  • 06:48 Southparkfan: rebuilt LC cache on mw1 for English language, problems fixed for now
  • 06:52 Southparkfan: reverted potential cause of issues
  • 14:56 Southparkfan: depooled mw1 for troubleshooting

Conclusions[edit source]

  • MediaWiki failed to properly handle localisation cache issues
  • Miraheze visitors failed to notify Miraheze staff during issues
  • Miraheze staff failed to spot the issue in a reasonable time
  • Icinga failed to notify us, because we lack good monitoring
  • Varnish failed to mark mw1 as sick and depool it, because we have no (good) health checks

Actionables[edit source]

  • #351: design a better localisation cache system
  • #77: add health checks to Varnish and ensure they work
  • #78: better monitoring for Icinga
  • Make Miraheze's visitors, volunteers and wiki founders aware how to contact the Miraheze System Administrators during trouble

Meta[edit source]

  • Incident handled by: Southparkfan
  • Affected services: all web requests for the wikis listed above that were routed to mw1
  • Signature: Southparkfan (talk) 17:35, 3 February 2016 (UTC)