Tech:Incidents/2017-06-NFS

DRAFT This draft is incomplete and any people involved with this issue are free to edit and/or add anything to this report

503 Backend Fetch failed, 504 and 502 issues were reported at various times, as well as uploads did not work during all the period mentioned below

Summary

 * What services were affected?
 * NFS, Tech:MediaWiki appservers, Varnish cache proxies
 * How long was there a visible outage?
 * Uploads did not work From 2017-06-15 14:14 UTC until 2017-06-26 23:20 UTC (10 days)
 * What was/were the response times by each sysadmin ?
 * Southparkfan responded at 18:40 UTC and restarted php5-fpm on mw*
 * NDKilla restarted the nfs-kernel-server on cp1 on 6-26 at 23:20
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * It's unknown at this time what caused the NFS server on cp1.miraheze.org to be stopped
 * Was the incident aggravated by human contact, users or investigating?
 * Does not seem to be aggravated in any way.
 * How could response time by improved?
 * The only people with access to resolve the issue were NDKilla and Southparkfan
 * Nobody seemed to really understand what the underlying issue was. People just noticed issues with uploads
 * It wasn't until right before the issue was resolved that NDKilla actually realized what the underlying issue was
 * As soon as NFS server was restarted and the mounts were re-done all was well

Timeline
All times are in UTC.


 * June 15
 * 17:12 - Icinga reports that mw2 is down on cp1 (no more uploads were possible after that time)
 * 18:12 - Icinga reports both mw1 and mw2 down on both cp1 and cp2
 * 17:09 Reception123: notices that the Miraheze is experiencing frequent 503s and checks all log files
 * 18:25 Reception123: emails staff (mainly SPF) about issues, for immediate response
 * 18:40 Southparkfan: restarted php5-fpm on mw1
 * 18:41 Southparkfan: before previous action, stopped php5-fpm on mw2


 * June 20
 * 16:49 Reception123: sees that /tmp is full on mw2 while investigating and wiped /tmp on mw2


 * June 25
 * 20:07 NDKilla: kill -9'd 3 procs on mw1 and 2 procs on mw2 after puppet hung
 * 20:08 NDKilla: restarted mw* via terminal


 * June 26
 * 23:20 NDKilla: started nfs-kernel-server on cp1

Note: These are times regarding actual actions. A lot of time was spent investigating the issue without lead to a resolution.

Conclusions
Something caused the NFS server to stop on cp1 which caused the static mounts on mw1 and mw2 to stop working.

Reporting

 * What services/sites were used to report the downtime?
 * Twitter, IRC (topic), Icinga
 * What other services/sites were available for reporting, but were not used?
 * Facebook (no access apart from Southparkfan and NDKilla)

Actionables

 * Investigate NFS issue further

Meta

 * Who responded to this incident?
 * Reception123, NDKilla, Southparkfan, Labster
 * What services were affected?
 * NFS, Tech:MediaWiki appservers, Varnish cache proxies
 * Who, therefore, needs to review this report?
 * Site Reliabilty Engineering
 * Timestamp: -- Cheers, NDKilla ( Talk • Contribs ) 17:31, 13 August 2017 (UTC)