Tech:Incidents/2015-11-14-SiteOutage

A network outage of NLSVZS1 brought servers misc1, db1, parsoid1 and mw1 down, thus causing a site outage (Varnish began spewing 503 errors) for around 10 minutes.

Timeline[edit | edit source]

20:20 Southparkfan: Icinga began massively marking various services of cp1, cp2 and ns1 as CRITICAL
20:24 Southparkfan: noticed Miraheze is down, began investigating. Noticed affected services are all on NLSVZS1
20:25 Southparkfan: not able to login into SolusVM due to slowness
20:27 Southparkfan: managed to login into SolusVM
20:28 Southparkfan: Icinga marks all services as OK, site is back up
20:37 Southparkfan: sent email to RamNode (ticket #421602) because this is not the first time NLSVZS1 is experiencing issues
20:42 RamNode: tells me they were taking care of an outbound DoS attack.

Conclusions[edit | edit source]

An outbound DoS attack was ongoing on NLSVZS1. While RamNode noticed that very quickly, their staff was not able to prevent an outage.

Actionables[edit | edit source]

While the outage of NLSVZS1 could not have been prevented by us, we should really put failover services on other nodes.

Purchase failover servers on an SVZ/CVZ node

Meta[edit | edit source]

Online during downtime: Southparkfan
Affected services: MediaWiki, MariaDB, monitoring (icinga/ganglia), Parsoid
Signature: Southparkfan (talk) 20:57, 14 November 2015 (UTC)