A network outage of NLSVZS1 brought servers misc1, db1, parsoid1 and mw1 down, thus causing a site outage (Varnish began spewing 503 errors) for around 10 minutes.
Timeline[edit | edit source]
- 20:20 Southparkfan: Icinga began massively marking various services of cp1, cp2 and ns1 as CRITICAL
- 20:24 Southparkfan: noticed Miraheze is down, began investigating. Noticed affected services are all on NLSVZS1
- 20:25 Southparkfan: not able to login into SolusVM due to slowness
- 20:27 Southparkfan: managed to login into SolusVM
- 20:28 Southparkfan: Icinga marks all services as OK, site is back up
- 20:37 Southparkfan: sent email to RamNode (ticket #421602) because this is not the first time NLSVZS1 is experiencing issues
- 20:42 RamNode: tells me they were taking care of an outbound DoS attack.
Conclusions[edit | edit source]
An outbound DoS attack was ongoing on NLSVZS1. While RamNode noticed that very quickly, their staff was not able to prevent an outage.
Actionables[edit | edit source]
While the outage of NLSVZS1 could not have been prevented by us, we should really put failover services on other nodes.
- https://github.com/miraheze/puppet/issues/35 - "Purchase failover servers on an SVZ/CVZ node"