Tech:Incidents/2017-06-NFS

DRAFT 503 Backend Fetch failed, 504 and 502 issues were reported at various times, as well as uploads did not work during all the period mentioned below

Summary

 * What services were affected?
 * Mw1, Mw2, Cp1, Cp2, NFS
 * How long was there a visible outage?
 * Uploads did not work From 2017-06-15 14:14 UTC until 2017-06-26 3:58 UTC (10 days)
 * What was/were the response times by each sysadmin ?
 * Southparkfan responded at 18:40 UTC and restarted php5-fpm on mw*
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Unknown
 * Was the incident aggravated by human contact, users or investigating?
 * Does not seem to be aggravated in any way.
 * How could response time by improved?
 * 

Timeline
All times are in UTC.


 * June 15
 * 17:12 - Icinga reports that mw2 is down on cp1 (no more uploads were possible after that time)
 * 18:12 - Icinga reports both mw1 and mw2 down on both cp1 and cp2
 * 17:09 Reception123: notices that the Miraheze is experiencing frequent 503s and checks all log files
 * 18:25 Reception123: emails staff (mainly SPF) about issues, for immediate response
 * 18:40 Southparkfan: restarted php5-fpm on mw1
 * 18:41 Southparkfan: before previous action, stopped php5-fpm on mw2


 * June 20
 * 16:49 Reception123: sees that /tmp is full on mw2 while investigating and wiped /tmp on mw2


 * June 25
 * 20:07 NDKilla: kill -9'd 3 procs on mw1 and 2 procs on mw2 after puppet hung
 * 20:08 NDKilla: restarted mw* via terminal

Note: these are just exact steps, and other things such as investigations (with no result) have been done outside of the actions recorded.

Conclusions
There was/is an issue with the NFS system.

Reporting

 * What services/sites were used to report the downtime?
 * Twitter, IRC (topic), Icinga
 * What other services/sites were available for reporting, but were not used?
 * Facebook (no access apart from Southparkfan and NDKilla)

Actionables

 * Investigate NFS issue further, and check if there are still any errors

Meta

 * Who responded to this incident?
 * Reception123, NDKilla, Southparkfan, Labster
 * What services were affected?
 * Mw1, Mw2, Cp1, Cp2, NFS
 * Who, therefore, needs to review this report?
 * All Mw-Admins and Operation members
 * Timestamp: ...