Tech:Incidents/2016-05-23-mw1
A crash of php5-fpm on mw1 caused 67 minutes of partial site outage.
Timeline[edit | edit source]
- 00:00 mw1 php5-fpm crashes due to OOM (see below for dmesg)
- 01:07 revi restarts HHVM, mw1 recovers
Conclusions[edit | edit source]
- Due to an out of memory situation, php5-fpm crashed.
May 23 00:00:19 mw1 kernel: [37297076.453605] Out of memory in UB 41075: OOM killed process 25870 (php5-fpm) score 83 vm:767020kB, rss:64936kB, swap:0kB (...) May 23 00:01:02 mw1 kernel: [37297119.671165] Out of memory in UB 41075: OOM killed process 27488 (php5-fpm) score 8 vm:676492kB, rss:6328kB, swap:0kB May 23 00:01:02 mw1 kernel: [37297119.689533] OOM killer in rage, 1 tasks killed May 23 00:01:02 mw1 kernel: [37297119.690122] Out of memory in UB 41075: OOM killed process 27489 (php5-fpm) score 9 vm:676496kB, rss:6748kB, swap:0kB
- John removed the Varnish health checks because they have been causing issues - but these health checks have never been enabled again. The result was that Varnish did not attempt to depool mw1.
Actionables[edit | edit source]
- Re-add the Varnish health checks
- See how we can lower memory usage on the appservers
Meta[edit | edit source]
- Incident handled by: revi
- Affected services: site (~50%, everything that was not cached in Varnish)
- Signature: Southparkfan (talk) 16:28, 23 May 2016 (UTC)