Tech:Incidents/2015-12-28-SiteOutage

An HHVM overload on mw1 brought HHVM down, compared with a lack of notifying during downtime, with as result 22 minutes of downtime (502 errors).

Timeline

 * 21:17 HHVM went down, first signs of trouble in the NGINX error log
 * 21:37 Southparkfan: noticed Miraheze went down
 * 21:39 Southparkfan: restarted HHVM. Couldn't initially find why it crashed, until I found a lot of traffic was going to mw1 at the moment HHVM crashed.

Conclusions

 * A bot (MJ12Bot was massively requesting dynamic content (special pages) from mw1, and at some point HHVM was not able anymore to keep up with the load, so it crashed.
 * Icinga failed to notify us when HHVM crashed. It does not have any checks for the HHVM process itself (it only has for HTTP/HTTPS). Because nginx was still running, it mistakenly 'thought' Miraheze was still up.

Actionables

 * Deny (or slow down) robots via robots.txt - ✅ in https://github.com/miraheze/puppet/commit/83fcafec7193d3124863fb670b4f28897713abcf
 * Make sure Icinga notifies us if HHVM isn't running - ✅ in https://github.com/miraheze/puppet/commit/b8d6c682666ab8b569f5060a461aa9dadf913063
 * Deploy additional MediaWiki servers that can serve the traffic until HHVM is restarted by someone - blocked by mw-config/#199
 * Add Terms of Use to that people know how often to send requests without breaking things.

Meta

 * Online during downtime: Southparkfan
 * Affected services: MediaWiki
 * Signature: Southparkfan (talk) 22:26, 28 December 2015 (UTC)