Tech:Icinga/Base Monitoring

This page is used to provide generic basic monitoring guidance if an alert goes off for core base monitoring. If there is any specific guidance for certain services (e.g. cloud infrastructure, cache proxies, MediaWiki) a link should be added from this page to the relevant guidance provided in the service documentation.

APT
If critical packages are listed for upgrade, go to the alerting server and check to see which packages require upgrading. If you are able to make the upgrades, feel free to. If not or if packages need further review, please open a task and assign it to the relevant service owner.

Current Load
Please review the demand on the server! If a server is constantly under high load, it might be time to review the demand being placed under the server.

[TODO: More guidance might be useful here?]

Disk Space
Firstly, please try and clear up some space if possible by clearing out cache files or log files. If you are not the service owner, feel free to create a Phabricator task to make the service owner aware so they can look further.

If additional disk space is required, please file a server resource request and seek the relevant approval to increase disk space.

If this is not a VM on cloud infrastructure, please notify Infrastructure using a generic Phabricator task so decisions can be made regarding how to resolve the alert.

NTP
An NTP time offset which immediately corrects itself is not a major concern as long as it does not repeat consistently. If this is the case, or the deviation grows more and more, the clock should be manually reset (or automatically through NTP) to the correct time, UTC.

Puppet
If the alert has been fired because of an administrative disablement, there is no need to do anything if this is a warning only. If critical, it may be a good idea to ask the person who disabled it if they still require it to be disabled.

If there is a puppet failure, debug the failure and attempt to make a fix if possible. If you are unable to debug the problem, raise a task with Infrastructure who will assist as general service owners of Puppet.

SSH
If you are the service owner and are able to restart the SSH service - please do so, this should resolve the alert. If you are not able to, or are not the service owner, contact Infrastructure who will resolve the alert.