Tech:Incidents/2016-07-10-cp1

Summary[edit | edit source]

On July 10th, RamNode conducted an emergency reboot of node "NLCVZE5-1". The node contains cp1 which hosts Varnish for European, African and Asian users as well as NFS for static.miraheze.org. The incident started at 22:24 UTC (July 10th) and resolved at 05:26 (July 11th). There was no aggrevation of the incident or outage by human interactions (John's reboot at 22:24) as the incident had began affecting puppet and basic Debian systems prior to that, John's reboot brought the service down through human interaction rather than unknown upstream action.

Timeline[edit | edit source]

July 10th

22:24: John: !log rebooting cp1, NFS will have to manage after
22:25: cp1: server goes down, doesn't boot and fails to respond to pings.
22:27: John: notices cp1 hasn't booted and no status is viewable from RamNode's control panel.
22:30: John: realises the whole node is down and the incident isn't contained to Miraheze.
22:38: Southparkfan: brings site up through nginx config changes and disables puppet.
22:43: John: mw2 is rebooted.
22:46: Southparkfan: manually and forcefully kill mounts to cp1.
22:55: John: disable uploads globally.
22:55: RamNode: responded to ticket with information that the node has been rebooted in an emergency action.

July 11th

01:04: staffing: operations coverage ends, strictly no deploys and changes is enforced
03:04: cp1: comes back online after RamNode finish rebooting and fsck.
05:24: John: remount static on mw1.
05:26: John: remount static on mw2.

Quick facts[edit | edit source]

NFS is not HA-friendly. This is known, this is bad.
cp1 is mostly a throw-away server as it hosts Varnish, except NFS makes it a critical service, this is bad.
NFS is easy to failover to an old backup in Bacula if necessary.

Conclusions[edit | edit source]

The incident was not caused by Miraheze and was not preventable by Miraheze.
John's reboot may not have been the best action but brought the situation under our own terms in knowing and handling.

Actionables[edit | edit source]

T471: Document and Evaluate NFS failover terms

Meta[edit | edit source]

Who responded to this incident? John, Southparkfan.
What services were affected? cp1 (NFS), mw1/mw2 (MediaWiki serving).
Who, therefore, needs to review this report? John (misc. operations stuff)
Timestamp: 00:09, 11 July 2016 (UTC)