Tech:Incidents/2016-05-23-mw1

From Meta
Jump to navigation Jump to search

A crash of php5-fpm on mw1 caused 67 minutes of partial site outage.

Timeline[edit source]

  • 00:00 mw1 php5-fpm crashes due to OOM (see below for dmesg)
  • 01:07 revi restarts HHVM, mw1 recovers

Conclusions[edit source]

  • Due to an out of memory situation, php5-fpm crashed.
May 23 00:00:19 mw1 kernel: [37297076.453605] Out of memory in UB 41075: OOM killed process 25870 (php5-fpm) score 83 vm:767020kB, rss:64936kB, swap:0kB
(...)
May 23 00:01:02 mw1 kernel: [37297119.671165] Out of memory in UB 41075: OOM killed process 27488 (php5-fpm) score 8 vm:676492kB, rss:6328kB, swap:0kB
May 23 00:01:02 mw1 kernel: [37297119.689533] OOM killer in rage, 1 tasks killed
May 23 00:01:02 mw1 kernel: [37297119.690122] Out of memory in UB 41075: OOM killed process 27489 (php5-fpm) score 9 vm:676496kB, rss:6748kB, swap:0kB
  • John removed the Varnish health checks because they have been causing issues - but these health checks have never been enabled again. The result was that Varnish did not attempt to depool mw1.

Actionables[edit source]

  • Re-add the Varnish health checks
  • See how we can lower memory usage on the appservers

Meta[edit source]

  • Incident handled by: revi
  • Affected services: site (~50%, everything that was not cached in Varnish)
  • Signature: Southparkfan (talk) 16:28, 23 May 2016 (UTC)