Tech:Incidents/2016-07-10-cp1

From Meta
Jump to navigation Jump to search

Summary[edit source]

On July 10th, RamNode conducted an emergency reboot of node "NLCVZE5-1". The node contains cp1 which hosts Varnish for European, Africa and Asian users as well as NFS for static.miraheze.org. The incident started at 22:24 UTC (July 10th) and resolved at 05:26 (July 11th). There was no aggrevation of the incident or outage by human interactions (John's reboot at 22:24) as the incident had began affecting puppet and basic Debian systems prior to that, John's reboot brought the service down through human interaction rather than unknown upstream action.

Timeline[edit source]

July 10th

  • 22:24: John: !log rebooting cp1, NFS will have to manage after
  • 22:25: cp1: server goes down, doesn't boot and fails to respond to pings.
  • 22:27: John: notices cp1 hasn't booted and no status is viewable from RamNode's control panel.
  • 22:30: John: realises the whole node is down and the incident isn't contained to Miraheze.
  • 22:38: Southparkfan: brings site up through nginx config changes and disables puppet.
  • 22:43: John: mw2 is rebooted.
  • 22:46: Southparkfan: manually and forcefully kill mounts to cp1.
  • 22:55: John: disable uploads globally.
  • 22:55: RamNode: responded to ticket with information that the node has been rebooted in an emergency action.

July 11th

  • 01:04: staffing: operations coverage ends, strictly no deploys and changes is enforced
  • 03:04: cp1: comes back online after RamNode finish rebooting and fsck.
  • 05:24: John: remount static on mw1.
  • 05:26: John: remount static on mw2.

Quick facts[edit source]

  • NFS is not HA-friendly. This is known, this is bad.
  • cp1 is mostly a throw-away server as it hosts Varnish, except NFS makes it a critical service, this is bad.
  • NFS is easy to failover to an old backup in Bacula if necessary.

Conclusions[edit source]

  • The incident was not caused by Miraheze and was not preventable by Miraheze.
  • John's reboot may not have been the best action but brought the situation under our own terms in knowing and handling.

Actionables[edit source]

  • T471: Document and Evaluate NFS failover terms

Meta[edit source]

  • Who responded to this incident? John, Southparkfan.
  • What services were affected? cp1 (NFS), mw1/mw2 (MediaWiki serving).
  • Who, therefore, needs to review this report? John (misc. operations stuff)
  • Timestamp: 00:09, 11 July 2016 (UTC)