Tech:Incidents/2016-07-10-cp1

Summary
On July 10th, RamNode conducted an emergency reboot of node "NLCVZE5-1". The node contains cp1 which hosts Varnish for European, African and Asian users as well as NFS for static.miraheze.org. The incident started at 22:24 UTC (July 10th) and resolved at 05:26 (July 11th). There was no aggrevation of the incident or outage by human interactions (John's reboot at 22:24) as the incident had began affecting puppet and basic Debian systems prior to that, John's reboot brought the service down through human interaction rather than unknown upstream action.

Timeline
July 10th July 11th
 * 22:24: John: !log rebooting cp1, NFS will have to manage after
 * 22:25: cp1: server goes down, doesn't boot and fails to respond to pings.
 * 22:27: John: notices cp1 hasn't booted and no status is viewable from RamNode's control panel.
 * 22:30: John: realises the whole node is down and the incident isn't contained to Miraheze.
 * 22:38: Southparkfan: brings site up through nginx config changes and disables puppet.
 * 22:43: John: mw2 is rebooted.
 * 22:46: Southparkfan: manually and forcefully kill mounts to cp1.
 * 22:55: John: disable uploads globally.
 * 22:55: RamNode: responded to ticket with information that the node has been rebooted in an emergency action.
 * 01:04: staffing: operations coverage ends, strictly no deploys and changes is enforced
 * 03:04: cp1: comes back online after RamNode finish rebooting and fsck.
 * 05:24: John: remount static on mw1.
 * 05:26: John: remount static on mw2.

Quick facts

 * NFS is not HA-friendly. This is known, this is bad.
 * cp1 is mostly a throw-away server as it hosts Varnish, except NFS makes it a critical service, this is bad.
 * NFS is easy to failover to an old backup in Bacula if necessary.

Conclusions

 * The incident was not caused by Miraheze and was not preventable by Miraheze.
 * John's reboot may not have been the best action but brought the situation under our own terms in knowing and handling.

Actionables

 * T471: Document and Evaluate NFS failover terms

Meta

 * Who responded to this incident? John, Southparkfan.
 * What services were affected? cp1 (NFS), mw1/mw2 (MediaWiki serving).
 * Who, therefore, needs to review this report? John (misc. operations stuff)
 * Timestamp: 00:09, 11 July 2016 (UTC)