A faulty SSL change (+ another cert/key mismatch) and its subsequent confusion triggered a rare systemd fault, crashing stunnel4 on cp4 (preventing connectivity between Varnish and MediaWiki servers) and leading into more than 1.5 hours of downtime for users accessing Miraheze Wikis via cp4.
- What services were affected?
- All wikis for users accessing Miraheze Wikis via cp4. Traffic going via cp2 was served fine.
- How long was there a visible outage?
- 2018-02-22 15:38 UTC until 17:16 UTC
- What was/were the response times by each Operations member?
- Reception123 responded at 15:37 on IRC, only due to the occurrence of 503 errors at random times, did not realize that it was a "permanent" error
- Southparkfan responded at 17:12 on IRC, assisting Reception123 during troubleshooting
- Was it caused by human error, supply/demand issues or something unknown currently?
- Partially unknown. Even while the faulty change was clearly human error (the cert/key mismatch too), the actual systemd fault is still a mystery.
- Was the incident aggravated by human contact, users or investigating?
- How could response time by improved?
- Operations should not have ignored the Icinga alarms for cp4.
All times are in UTC.
- 15:26: Reception123 merged a bad change in puppet, because the associated private key should have been pushed to the private git repo first.
- 15:31: puppet runs on cp4. Coincidentally varnishd crashed (OOM) during this puppet run as well. We are seeing two problems at this point:
Feb 21 15:31:15 cp4 puppet-agent: (/Stage[main]/Ssl::Hiera/Ssl::Hiera::Certs[fikcyjnatv]/File[fikcyjnatv.pl_private]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///ssl-keys/fikcyjnatv.pl.key Feb 21 15:31:30 cp4 puppet-agent: (/Stage[main]/Varnish::Nginx/Exec[nginx-syntax]/returns) nginx: [emerg] SSL_CTX_use_PrivateKey_file("/etc/ssl/private/www.reviwiki.info.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch)
- 15:34: First Icinga alarm for cp4:
PROBLEM - Varnish Backends on cp4 is CRITICAL: 3 backends are down: mw1mw2mw3. May have been unrelated (see one line below).
- 15:36: Second Icinga alarm for cp4:
PROBLEM - HTTP 4xx/5xx ERROR Rate on cp4 is CRITICAL: CRITICAL - NGINX Error Rate is 94%. Even though 94% is unusually high, it is possible that this alarm was unrelated, and the backends were down due to the other, frequently occurring 503 storm.
- 15:38: Reception123 performs a puppet run on cp4
- 15:38: Reception123 manually reloads nginx (while above errors were still unfixed) which triggered a systemd error (cause unknown) crashing stunnel4. After this point cp4 was unusable:
Feb 21 15:38:15 cp4 puppet-agent: Finished catalog run in 5.88 seconds Feb 21 15:38:23 cp4 systemd: Reloading LSB: Stop/start nginx. Feb 21 15:38:23 cp4 systemd: Failed to reset devices.list on /system.slice/stunnel4.service: No such file or directory Feb 21 15:38:23 cp4 nginx: Reloading nginx: nginx. Feb 21 15:38:23 cp4 systemd: Reloaded LSB: Stop/start nginx.
- 16:31: Southparkfan mentions strange 503 errors (occurring longer than usual) in IRC
- 17:12-17:15: Reception123 and Southparkfan are troubleshooting the 503 errors
- 17:15: Reception123 restarts stunnel4, site comes back online
Quick facts[edit source]
- Two ssl keypairs were wrong, but puppet performs nginx syntax checks thus nginx was not reloaded; expected behaviour.
- Reception123 forced a reload on cp4, which triggered the 'Failed to reset devices.list' error and crashed stunnel.
- The cause of the error mentioned above is unknown, but should definitely not be triggered by just reloading nginx regardless of its configuration issues.
- Health checks in gdnsd did not catch the cp4 failure.
- Operations did not investigate the Icinga alarms that told them something was really wrong with cp4.
- The procedure for managing SSL keypairs must be adjusted thoroughly, to prevent configuration mistakes
- It may be required to do a configtest ('sudo nginx -t') before attempting to touch the nginx service (reload, restart)
- Operations should not ignore the Icinga alarms for cache proxies. While they show up frequently, this time they didn't go away after a few minutes, which is a sign something is broken.
- Our gdnsd health checks should check for a valid HTTP response; not just if a basic TCP request gets a response.
- What services/sites were used to report the downtime?
- None at the right time.
- What other services/sites were available for reporting, but were not used?
- Adjust gdnsd health check - Done #1 #2 #3
- Revise the procedure for managing SSL keypairs? Not Done
- Reinstall cp4 if needed Not Done
- May want to try to reproduce this error again (but this time while cp4 is properly depooled) Not Done
- Who responded to this incident?
- Reception123, Southparkfan
- What services were affected?
- All wikis for users accessing Miraheze using cp4
- Who, therefore, needs to review this report?
- All Operations members
- Timestamp: 00:16, 1 March 2018 (UTC)