Tech:Incidents/2017-06-NFS

DRAFT This draft is incomplete and any people involved with this issue are free to edit and/or add anything to this report

503 Backend Fetch failed, 504 and 502 issues were reported at various times, as well as uploads did not work during all the period mentioned below

Summary

 * What services were affected?
 * Mw1, Mw2, Cp1, Cp2, NFS
 * How long was there a visible outage?
 * Uploads did not work From 2017-06-15 14:14 UTC until 2017-06-26 3:58 UTC (10 days)
 * What was/were the response times by each sysadmin ?
 * Southparkfan responded at 18:40 UTC and restarted php5-fpm on mw*
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Unknown
 * Was the incident aggravated by human contact, users or investigating?
 * Does not seem to be aggravated in any way.
 * How could response time by improved?
 * 

Timeline
All times are in UTC.


 * June 15
 * 17:12 - Icinga reports that mw2 is down on cp1 (no more uploads were possible after that time)
 * 18:12 - Icinga reports both mw1 and mw2 down on both cp1 and cp2
 * 17:09 Reception123: notices that the Miraheze is experiencing frequent 503s and checks all log files
 * 18:25 Reception123: emails staff (mainly SPF) about issues, for immediate response
 * 18:40 Southparkfan: restarted php5-fpm on mw1
 * 18:41 Southparkfan: before previous action, stopped php5-fpm on mw2


 * June 20
 * 16:49 Reception123: sees that /tmp is full on mw2 while investigating and wiped /tmp on mw2


 * June 25
 * 20:07 NDKilla: kill -9'd 3 procs on mw1 and 2 procs on mw2 after puppet hung
 * 20:08 NDKilla: restarted mw* via terminal

Note: these are just exact steps, and other things such as investigations (with no result) have been done outside of the actions recorded.

Conclusions
There was/is an issue with the NFS system.

Reporting

 * What services/sites were used to report the downtime?
 * Twitter, IRC (topic), Icinga
 * What other services/sites were available for reporting, but were not used?
 * Facebook (no access apart from Southparkfan and NDKilla)

Actionables

 * Investigate NFS issue further, and check if there are still any errors

Meta

 * Who responded to this incident?
 * Reception123, NDKilla, Southparkfan, Labster
 * What services were affected?
 * Mw1, Mw2, Cp1, Cp2, NFS
 * Who, therefore, needs to review this report?
 * All Mw-Admins and Operation members
 * Timestamp: ...