Tech:Incidents/2016-07-10-cp1

Summary
On July 10th, RamNode conducted an emergency reboot of node "NLCVZE5-1". The node contains cp1 which hosts Varnish for European, Africa and Asian users as well as NFS for static.miraheze.org. The incident started at 22:24 UTC and resolved at . There was no aggrevation of the incident or outage by human interactions (John's reboot at 22:24) as the incident had began affecting puppet and basic Debian systems prior to that, John's reboot brought the service down through human interaction rather than unknown upstream action.

Timeline
July 10th
 * 22:24: John: !log rebooting cp1, NFS will have to manage after
 * 22:25: cp1: server goes down, doesn't boot and fails to respond to pings.
 * 22:27: John: notices cp1 hasn't booted and no status is viewable from RamNode's control panel.
 * 22:30: John: realises the whole node is down and the incident isn't contained to Miraheze.
 * 22:38: Southparkfan: brings site up through undisclosed techniques with puppet disabled.
 * 22:43: John: mw2 is rebooted.
 * 22:46: Southparkfan: manually and forcefully kill mounts to cp1.
 * 22:55: John: disable uploads globally.
 * 22:55: RamNode: responded to ticket with information that the node has been rebooted in an emergency action.

Quick facts

 * NFS is not HA-friendly. This is known, this is bad.
 * cp1 is mostly a throw-away server as it hosts Varnish, except NFS makes it a critical service, this is bad.
 * NFS is easy to failover to an old backup in Bacula if necessary.

Conclusions

 * The incident was not caused by Miraheze and was not preventable by Miraheze.
 * John's reboot may not have been the best action but brought the situation under our own terms in knowing and handling.

Actionables

 * T471: Document and Evaluate NFS failover terms

Meta

 * Who responded to this incident? John, Southparkfan.
 * What services were affected? cp1 (NFS), mw1/mw2 (MediaWiki serving).
 * Who, therefore, needs to review this report? John (misc. operations stuff)
 * Timestamp: 00:09, 11 July 2016 (UTC)