On July 10th, RamNode conducted an emergency reboot of node "NLCVZE5-1". The node contains cp1 which hosts Varnish for European, Africa and Asian users as well as NFS for static.miraheze.org. The incident started at 22:24 UTC (July 10th) and resolved at 05:26 (July 11th). There was no aggrevation of the incident or outage by human interactions (John's reboot at 22:24) as the incident had began affecting puppet and basic Debian systems prior to that, John's reboot brought the service down through human interaction rather than unknown upstream action.
- 22:24: John: !log rebooting cp1, NFS will have to manage after
- 22:25: cp1: server goes down, doesn't boot and fails to respond to pings.
- 22:27: John: notices cp1 hasn't booted and no status is viewable from RamNode's control panel.
- 22:30: John: realises the whole node is down and the incident isn't contained to Miraheze.
- 22:38: Southparkfan: brings site up through nginx config changes and disables puppet.
- 22:43: John: mw2 is rebooted.
- 22:46: Southparkfan: manually and forcefully kill mounts to cp1.
- 22:55: John: disable uploads globally.
- 22:55: RamNode: responded to ticket with information that the node has been rebooted in an emergency action.
- 01:04: staffing: operations coverage ends, strictly no deploys and changes is enforced
- 03:04: cp1: comes back online after RamNode finish rebooting and fsck.
- 05:24: John: remount static on mw1.
- 05:26: John: remount static on mw2.
Quick facts[edit source]
- NFS is not HA-friendly. This is known, this is bad.
- cp1 is mostly a throw-away server as it hosts Varnish, except NFS makes it a critical service, this is bad.
- NFS is easy to failover to an old backup in Bacula if necessary.
- The incident was not caused by Miraheze and was not preventable by Miraheze.
- John's reboot may not have been the best action but brought the situation under our own terms in knowing and handling.
- T471: Document and Evaluate NFS failover terms
- Who responded to this incident? John, Southparkfan.
- What services were affected? cp1 (NFS), mw1/mw2 (MediaWiki serving).
- Who, therefore, needs to review this report? John (misc. operations stuff)
- Timestamp: 00:09, 11 July 2016 (UTC)