Tech:Incidents/2015-12-28-SiteOutage

From Meta
Jump to navigation Jump to search

Comment (February 2016): while HHVM *did* crash, it crashed because of another (unknown) problem. Since December we have a lot of issues with HHVM, and 98% of the time we are running PHP-FPM. Update (March 2016): the cause of most of the crashes seems to be the Linux OOM-killer. We've implemented a cron that restarts HHVM each two hours, and we'll soon look if we can assign more memory to our servers.

An HHVM overload Something on mw1 brought HHVM down, compared with a lack of notifying during downtime, with as result 22 minutes of downtime (502 errors).

Timeline[edit source]

  • 21:17 HHVM went down, first signs of trouble in the NGINX error log
  • 21:37 Southparkfan: noticed Miraheze went down
  • 21:39 Southparkfan: restarted HHVM. Couldn't initially find why it crashed, until I found a lot of traffic was going to mw1 at the moment HHVM crashed.

Conclusions[edit source]

  • A bot (MJ12Bot was massively requesting dynamic content (special pages) from mw1, and at some point HHVM was not able anymore to keep up with the load, so it crashed. We don't know if this was the case.
  • Icinga failed to notify us when HHVM crashed. It does not have any checks for the HHVM process itself (it only has for HTTP/HTTPS). Because nginx was still running, it mistakenly 'thought' Miraheze was still up.

Actionables[edit source]

Meta[edit source]

  • Online during downtime: Southparkfan
  • Affected services: MediaWiki
  • Signature: Southparkfan (talk) 22:26, 28 December 2015 (UTC)