Tech:Incidents/2015-11-14-SiteOutage

From Meta
Jump to navigation Jump to search

A network outage of NLSVZS1 brought servers misc1, db1, parsoid1 and mw1 down, thus causing a site outage (Varnish began spewing 503 errors) for around 10 minutes.

Timeline[edit source]

  • 20:20 Southparkfan: Icinga began massively marking various services of cp1, cp2 and ns1 as CRITICAL
  • 20:24 Southparkfan: noticed Miraheze is down, began investigating. Noticed affected services are all on NLSVZS1
  • 20:25 Southparkfan: not able to login into SolusVM due to slowness
  • 20:27 Southparkfan: managed to login into SolusVM
  • 20:28 Southparkfan: Icinga marks all services as OK, site is back up
  • 20:37 Southparkfan: sent email to RamNode (ticket #421602) because this is not the first time NLSVZS1 is experiencing issues
  • 20:42 RamNode: tells me they were taking care of an outbound DoS attack.

Conclusions[edit source]

An outbound DoS attack was ongoing on NLSVZS1. While RamNode noticed that very quickly, their staff was not able to prevent an outage.

Actionables[edit source]

While the outage of NLSVZS1 could not have been prevented by us, we should really put failover services on other nodes.

Meta[edit source]

  • Online during downtime: Southparkfan
  • Affected services: MediaWiki, MariaDB, monitoring (icinga/ganglia), Parsoid
  • Signature: Southparkfan (talk) 20:57, 14 November 2015 (UTC)