Tech:Organisation/Site Reliability Engineering

This is a guide for all Miraheze Site Reliability Engineers. They have access to as well as all Miraheze GitHub repositories and they are in charge of maintaining all Miraheze servers and making sure they function smoothly.

Rules

 * 1) Be respectful to other volunteers and users. You represent the Miraheze project.
 * 2) Don't suddenly change big parts of the infrastructure (MediaWiki, Varnish, Bacula, etc.) (e.g. way how things are done in the current style) without discussing it with the other site reliability engineers (and any sysadmins)
 * 3) Be VERY careful when manipulating sensitive data (such as db* or nfs*) as it could lead to data loss.
 * 4) Don't use the servers for non-Miraheze purposes.
 * 5) Don't put abnormally high load(s) on the server(s) if avoidable. (Grafana can be used for more details)
 * 6) Respect privacy. Don't publish access logs, IP addresses, content of private wikis, or other personally identifiable information. If in doubt, ask before publishing.
 * 7) Don't publish database passwords, private keys, etc as well.

Violation of these rules can result into warnings or revocation of access.

Deployment

 * When deploying a change (SSL certificate, database rename, etc.), you are required to closely watch the change going live.
 * After commiting a change to any repo (and being sure it should work), run 'sudo puppet agent -tv' on the server involved. It can take a while before the change is actually deployed.
 * Watch the error logs:

Further specifics to be filled in by SRE

Monitoring errors
''To be filled in for specific servers'

Debugging

 * Look at the error logs
 * Try to send the failing HTTP request with the header 'X-Miraheze-Debug: 1', it could be an error that is cached in Varnish