User:Void/SRE/Deployments policy

This is a draft of a proposed policy, it should not be taken as an actual policy/guideline.

The goal of this document is to establish the best practices for deploying updates and other changes, both to MediaWiki and Miraheze's infrastructure at large.

Accountability
Merging changes, particularly one which may introduce a breaking change, the SRE member must remain available until after the change is deployed to all servers (typically after a puppet run) to ensure the change has not brought down any services. The SRE member may hand off this responsibility to another SRE member provided that they also have access to revert or fix whatever service is being modified. Should a deployed change cause an outage, the deploying SRE is responsible for creating an incident report, both for documenting the outage, and additionally to identify what they should do differently in the future to prevent similar incidents.

Updates
When planning an update, SRE should evaluate the following:
 * Does this change require downtime?
 * If so, how long (estimate)?
 * How long will the update process take (including possible downtime)?
 * What services will/may be affected by the update?
 * What is the specific plan for completing the update? What needs to be done, which commands need to be ran when, etc.

Before performing the update, and using the information above, SRE should first:
 * Identify an optimal upgrade window, determining who is available when.
 * At least one SRE member should have full access to the impacted servers during this upgrade window. This individual is responsible for planning the update and delegating roles and responsibilities as necessary. Longer upgrades should include a plan for an emergency hand-off, even if this is never used.
 * All involved SRE should have clearly defined roles. Additionally, SRE should be able to swap roles if necessary. This includes having a full command/task list for every server/service touched by each role.
 * Announce any planned downtime as early in advance as possible. A full week is preferable, but may be longer or shorter as required.
 * Perform testing where applicable (ideally on a test server).
 * Include a plan for issues that may occur during the update. Some common issues can be planned for and expected (such as an extension not working due to a MediaWiki update), whereas others cannot. Some plans should be in place for these common issues, but a general plan should be prepared for all other issues that may occur.

During the upgrade window, SRE should:
 * Announce the ongoing update, and reiterate any planned downtime.
 * Be aware of each others' roles and responsibilities. SRE members should not perform any action not specifically assigned to them to do, nor should they engage with issues unrelated to the update, particularly where this might delay their work on the update (personal issues are obviously excluded from this clause).
 * Regularly check in with the other SRE involved in the update, making sure that everything is progressing normally and without issue.
 * Provide status updates regularly, and inform end users of any changes to the downtime.

Configuration changes
Unfinished

All configuration changes not done for security reasons must be verified to ensure that it reflects the wishes of the community that it impacts. When in doubt contact your team manager. A private conversation does not count as consensus, and neither does a conversation held on a platform separate from the one affected by the change (eg. a conversation on IRC cannot be used to justify a change impacting metawiki).