Tech:Incidents/2019-01-11-redis-down

From Meta
Jump to navigation Jump to search

DRAFT Due to a configuration change in systemd that should have been applied long time ago and an unknown factor finally letting redis fail, redis and services dependent on redis were broken for about 35 minutes.

Summary[edit | edit source]

Provide a summary of the incident:

  • What services were affected?
    • All services dependent on redis. (MediaWiki sessions/login + JobRunner)
  • How long was there a visible outage?
    • 14:17 till 14:52, so about 35 minutes
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • Initially human error, since we should have introduced a change to the systemd unit, although it ran fine for months without that change so something unknown finally triggered the error.
  • Was the incident aggravated by human contact, users or investigating?
    • No.

Timeline[edit | edit source]

  • 14:15: paladox rebooted misc2 to clear some ram as it was full.
  • 14:17: redis can't save database (read-only file system) thus refuses to save keys
  • 14:36: Southparkfan notices login is broken
  • 14:43: Southparkfan assumes redis is full, thus introducing patches to MediaWiki core to decrease memory usage
  • 14:46: Southparkfan notices the actual issue is redis not being able to write to the database
  • 14:52: Southparkfan disables puppet, introduces a patch to the systemd file, restarts redis - redis back online with the database more or less intact

Conclusions[edit | edit source]

  • A missing configuration change (ReadWriteDirectories inside the systemd file) should have been applied at the very least in June 2018, when paladox enabled syncing the database to disk every 60 seconds
  • Due to an unknown reason, there were no issues until more than 6 months later, when paladox rebooted misc2 and redis finally refused to save keys to its cache


Actionables[edit | edit source]

Meta[edit | edit source]

  • Who responded to this incident?
    • Paladox and Southparkfan
  • What services were affected?
    • MediaWiki (sessions/login) JobRunner.
  • Who, therefore, needs to review this report?
    • John or Southparkfan.
  • Timestamp.
    • <yet to sign>