Tech:Icinga/Base Monitoring

This page is used to provide generic basic monitoring guidance if an alert goes off for core base monitoring. If there is any specific guidance for certain services (e.g. cloud infrastructure, cache proxies, MediaWiki) a link should be added from this page to the relevant guidance provided in the service documentation.

APT
If critical packages are listed for upgrade, go to the alerting server and check to see which packages require upgrading. If you are able to make the upgrades, feel free to. If not or if packages need further review, please open a task and assign it to the relevant service owner.

Current Load
Please review the demand on the server! If a server is constantly under high load, it might be time to review the demand being placed under the server.

Utilise services like *top to determine what might be causing the high loads and whether any services running on the service could be unfairly consuming CPU time. Disk IO might also be causing abnormally high wait times for the CPU.

Disk Space
Firstly, please try and clear up some space if possible by clearing out cache files or log files. If you are not the service owner, feel free to create a Phabricator task to make the service owner aware so they can look further.

If additional disk space is required, please file a server resource request and seek the relevant approval to increase disk space.

If this is not a VM on cloud infrastructure, please notify Infrastructure using a generic Phabricator task so decisions can be made regarding how to resolve the alert.

NTP
An NTP time offset which immediately corrects itself is not a major concern as long as it does not repeat consistently. If this is the case, or the deviation grows more and more, the clock should be manually reset (or automatically through NTP) to the correct time, UTC.

PowerDNS Recursor
PowerDNS is our server side DNS cache - this is responsible for giving us amazing quick load times for repeated DNS lookups by caching them for up to 5 minutes at a time. If this service is providing any non-OK status codes for DNS responses to miraheze.org, it is critical to debug this as a priority. The service might just require a restart - but even if a restart fixes the problems - please flag this immediately to anyone who handles DNS to investigate whether a more thorough debugging is required.

Puppet
If the alert has been fired because of an administrative disablement, there is no need to do anything if this is a warning only. If critical, it may be a good idea to ask the person who disabled it if they still require it to be disabled.

If there is a puppet failure, debug the failure and attempt to make a fix if possible. If you are unable to debug the problem, raise a task with Infrastructure who will assist as general service owners of Puppet.

SSH
If you are the service owner and are able to restart the SSH service - please do so, this should resolve the alert. If you are not able to, or are not the service owner, contact Infrastructure who will resolve the alert.