Tech:Icinga/MediaWiki Monitoring

This page is used to provide monitoring guidance for MediaWiki-related services if an alert goes off.

JobChron Service

 * Why does this check exist? What does it monitor? This check monitors whether the JobChron service is up and running.
 * Is an alert a bad thing? Yes, JobChron should always be running on the redis servers.
 * If its warning/critical, how do we fix it? It may be possible to resolve by either manually starting the JobChron service, or restarting it . You can also try and see why it failed to start by viewing the service status . If you don't have access to the relevant server(s), a task on Phabricator should be filed and those with access should be alerted to be able to assist.

Redis Service

 * Why does this check exist? What does it monitor? This check simply monitors whether the Redis service is up and running.
 * Is an alert a bad thing? Yes, Redis should always be running.
 * If it's warning/critical, how do we fix it? A task on Phabricator should be filed immediately and everyone (including Infrastructure) should be pinged to investigate why it is down and resolve.

PHP-FPM

 * Why does this check exist? What does it monitor? This check monitors whether PHP-FPM processes are up and running.
 * Is an alert a bad thing? Yes, PHP-FPM should always be running.
 * If it's warning/critical, how do we fix it? It may be possible to resolve by either manually starting the php-fpm service, or restarting it . You can also try and see why it failed to start, by viewing the service status . If you are unable to resolve the issue, then a Phabricator task, triaged as "Unbreak Now" should be filed, and the relevant technical team members alerted to assist. If this alert is given, chances are users may receive errors on their end with PHP-FPM down. If that's the case, an incident report should also be filed for the issue.

Memcached

 * Why does this check exist? What does it monitor? This check simply monitors whether the memcached service is up and running.
 * Is an alert a bad thing? Yes, memcached should always be running.
 * If it's warning/critical, how do we fix it? A Phabricator task should be created, triaged as "Unbreak Now" until the issue is resolved. While in theory the MediaWiki Team manages, since only Reception123 has access, Infrastructure should also be alerted to be able to assist.

Nutcracker

 * Why does this check exist? What does it monitor? This check monitors whether the nutcracker service is running.
 * Is an alert a bad thing? Yes, nutcracker should always be running.
 * If it's warning/critical, how do we fix it? It may be possible to resolve by either manually starting the nutcracker service, or restarting it . You can also try to see why it failed to start, by viewing the service status . If you are unable to immediately resolve this, a Phabricator task should be created, triaged as "Unbreak Now" until the issue is resolved.

JobRunner Service

 * Why does this check exist? What does it monitor? This check monitors whether the jobrunner process is running correctly.
 * Is an alert a bad thing? Yes, unless done intentionally by a member of SRE, jobrunner should be running on all designated servers at all times.
 * If it's warning/critical, how do we fix it? If the jobrunner is not running on a server, it can be restarted by doing . If there is an error when restarting, that must be investigated. You can try to see what error occured using  .

MediaWiki Rendering

 * Why does this check exist? What does it monitor? This check monitors whether MediaWiki is accessible.
 * Is an alert a bad thing? Yes, it means MediaWiki may be down, and users may be receiving errors on their end.
 * If it's warning/critical, how do we fix it? If the issue persists and users are reporting errors such as 502 or 503, then a Phabricator task should be created, triaged as "Unbreak Now" until the issue is resolved for users. To attempt to resolve it, restarting PHP-FPM could help under some circumstances. If you have root access, you could also try restarting the relevant server(s). If that still doesn't work, try looking at logs, to determine the cause of the outage. Chances are it will resolve itself after some time if you're unable to. If the outage was user-facing, an incident report should also be filed for it.

MirahezeRenewSSL

 * Why does this check exist? What does it monitor? This check monitors whether the SSL renewal services is up and running.
 * Is an alert a bad thing? Yes, this must be investigated.
 * If it's warning/critical, how do we fix it? If the service is down it should be restarted. If it cannot be restarted/is not functional, that should be further investigated.

SSL Validity Checks

 * Why does this check exist? What does it monitor? This check monitors whether SSL certificates are set to expire in the next 15 days.
 * Is an alert a bad thing? Usually no, MirahezeSSLBot will renew non-wildcard certs. If it is CRITICAL, however, there might be a problem with the bot, so that needs reporting.
 * If it's warning/critical, how do we fix it? If it is critical (and non-wildcard) the reason for why the bot is not auto-generating should be investigated. If it cannot be figured out in time, the certs should be manually renewed. For wildcards certs, they need to be renewed manually in any case once there is a warning.

Reverse DNS Checks

 * Why does this check exist? What does it monitor? This check performs a reverse DNS lookup to determine domains that have been pointed to us are still pointing to us
 * Is an alert a bad thing? Yes, if not just a temporary issue that resolves itself, it is a problem.
 * If it's warning/critical, how do we fix it? The person who requested the custom domain (or other wiki bureaucrats) should be notified and asked to point their domains to us again. If there is no response and/or they fail to do so in a reasonable time (not less than a week) the custom domain and SSL certificate should be removed until the issue is resolved.