Tech:Incidents/2018-02-22-Stunnel

A faulty SSL change (+ another cert/key mismatch) and its subsequent confusion triggered a rare systemd fault, crashing stunnel4 on cp4 (preventing connectivity between Varnish and MediaWiki servers) and leading into more than 1.5 hours of downtime for users accessing Miraheze Wikis via cp4.

Summary

 * What services were affected?
 * All wikis for users accessing Miraheze Wikis via cp4. Traffic going via cp2 was served fine.
 * How long was there a visible outage?
 * 2018-02-22 15:38 UTC until 17:16 UTC
 * What was/were the response times by each Site Reliabilty Engineering member?
 * Reception123 responded at 15:37 on IRC, only due to the occurrence of 503 errors at random times, did not realize that it was a "permanent" error
 * Southparkfan responded at 17:12 on IRC, assisting Reception123 during troubleshooting
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Partially unknown. Even while the faulty change was clearly human error (the cert/key mismatch too), the actual systemd fault is still a mystery.
 * Was the incident aggravated by human contact, users or investigating?
 * No.
 * How could response time by improved?
 * Site Reliabilty Engineering should not have ignored the Icinga alarms for cp4.

Timeline
All times are in UTC. Feb 21 15:31:15 cp4 puppet-agent[4019]: (/Stage[main]/Ssl::Hiera/Ssl::Hiera::Certs[fikcyjnatv]/File[fikcyjnatv.pl_private]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///ssl-keys/fikcyjnatv.pl.key Feb 21 15:31:30 cp4 puppet-agent[4019]: (/Stage[main]/Varnish::Nginx/Exec[nginx-syntax]/returns) nginx: [emerg] SSL_CTX_use_PrivateKey_file("/etc/ssl/private/www.reviwiki.info.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch) Feb 21 15:38:15 cp4 puppet-agent[5018]: Finished catalog run in 5.88 seconds Feb 21 15:38:23 cp4 systemd[1]: Reloading LSB: Stop/start nginx. Feb 21 15:38:23 cp4 systemd[1]: Failed to reset devices.list on /system.slice/stunnel4.service: No such file or directory Feb 21 15:38:23 cp4 nginx[5498]: Reloading nginx: nginx. Feb 21 15:38:23 cp4 systemd[1]: Reloaded LSB: Stop/start nginx.
 * 15:26: Reception123 merged a bad change in puppet, because the associated private key should have been pushed to the private git repo first.
 * 15:31: puppet runs on cp4. Coincidentally varnishd crashed (OOM) during this puppet run as well. We are seeing two problems at this point:
 * 15:34: First Icinga alarm for cp4: . May have been unrelated (see one line below).
 * 15:36: Second Icinga alarm for cp4: . Even though 94% is unusually high, it is possible that this alarm was unrelated, and the backends were down due to the other, frequently occurring 503 storm.
 * 15:38: Reception123 performs a puppet run on cp4
 * 15:38: Reception123 manually reloads nginx (while above errors were still unfixed) which triggered a systemd error (cause unknown) crashing stunnel4. After this point cp4 was unusable:
 * 16:31: Southparkfan mentions strange 503 errors (occurring longer than usual) in IRC
 * 17:12-17:15: Reception123 and Southparkfan are troubleshooting the 503 errors
 * 17:15: Reception123 restarts stunnel4, site comes back online

Quick facts

 * Two ssl keypairs were wrong, but puppet performs nginx syntax checks thus nginx was not reloaded; expected behaviour.
 * Reception123 forced a reload on cp4, which triggered the 'Failed to reset devices.list' error and crashed stunnel.
 * The cause of the error mentioned above is unknown, but should definitely not be triggered by just reloading nginx regardless of its configuration issues.
 * Health checks in gdnsd did not catch the cp4 failure.
 * Site Reliabilty Engineering did not investigate the Icinga alarms that told them something was really wrong with cp4.

Conclusions

 * The procedure for managing SSL keypairs must be adjusted thoroughly, to prevent configuration mistakes
 * It may be required to do a configtest ('sudo nginx -t') before attempting to touch the nginx service (reload, restart)
 * Site Reliabilty Engineering should not ignore the Icinga alarms for cache proxies. While they show up frequently, this time they didn't go away after a few minutes, which is a sign something is broken.
 * Our gdnsd health checks should check for a valid HTTP response; not just if a basic TCP request gets a response.

Reporting

 * What services/sites were used to report the downtime?
 * None at the right time.
 * What other services/sites were available for reporting, but were not used?
 * N/A

Actionables

 * Adjust gdnsd health check - ✅ #1 #2 #3
 * Revise the procedure for managing SSL keypairs? ❌
 * Reinstall cp4 if needed ❌
 * May want to try to reproduce this error again (but this time while cp4 is properly depooled) ❌

Meta

 * Who responded to this incident?
 * Reception123, Southparkfan
 * What services were affected?
 * All wikis for users accessing Miraheze using cp4
 * Who, therefore, needs to review this report?
 * All Site Reliabilty Engineering members
 * Timestamp: 00:16, 1 March 2018 (UTC)