User:Void/SRE/Deployments policy

From Miraheze Meta, Miraheze's central coordination wiki
< User:Void‎ | SRE

This is a draft of a proposed policy, it should not be taken as an actual policy/guideline.

The goal of this document is to establish the best practices for deploying updates and other changes, both to MediaWiki and Miraheze's infrastructure at large.

Accountability[edit source]

Merging changes, particularly one which may introduce a breaking change, the SRE member must remain available until after the change is deployed to all servers (typically after a puppet run) to ensure the change has not brought down any services. The SRE member may hand off this responsibility to another SRE member provided that they also have access to revert or fix whatever service is being modified. Should a deployed change cause an outage, the deploying SRE is responsible for creating an incident report, both for documenting the outage, and additionally to identify what they should do differently in the future to prevent similar incidents.

Updates[edit source]

When planning an update, SRE should evaluate the following:

  • Does this change require downtime?
    • If so, how long (estimate)?
  • How long will the update process take (including possible downtime)?
  • What services will/may be affected by the update?
  • What is the specific plan for completing the update? What needs to be done, which commands need to be ran when, etc.

Before performing the update, and using the information above, SRE should first:

  • Identify an optimal upgrade window, determining who is available when.
    • At least one SRE member should have full access to the impacted servers during this upgrade window. This individual is responsible for planning the update and delegating roles and responsibilities as necessary. Longer upgrades should include a plan for an emergency hand-off, even if this is never used.
    • All involved SRE should have clearly defined roles. Additionally, SRE should be able to swap roles if necessary. This includes having a full command/task list for every server/service touched by each role.
  • Announce any planned downtime as early in advance as possible. A full week is preferable, but may be longer or shorter as required.
  • Perform testing where applicable (ideally on a test server).
  • Include a plan for issues that may occur during the update. Some common issues can be planned for and expected (such as an extension not working due to a MediaWiki update), whereas others cannot. Some plans should be in place for these common issues, but a general plan should be prepared for all other issues that may occur.

During the upgrade window, SRE should:

  • Announce the ongoing update, and reiterate any planned downtime.
  • Be aware of each others' roles and responsibilities. SRE members should not perform any action not specifically assigned to them to do, nor should they engage with issues unrelated to the update, particularly where this might delay their work on the update (personal issues are obviously excluded from this clause).
  • Regularly check in with the other SRE involved in the update, making sure that everything is progressing normally and without issue.
  • Provide status updates regularly, and inform end users of any changes to the downtime.

Configuration changes[edit source]

Unfinished

All configuration changes not done for security reasons must be verified to ensure that it reflects the wishes of the community that it impacts. When in doubt contact your team manager. A private conversation does not count as consensus, and neither does a conversation held on a platform separate from the one affected by the change (eg. a conversation on IRC cannot be used to justify a change impacting metawiki).

Comments by Reception123[edit source]

In addition to 'security reasons' there are also other changes which I don't think should need a community discussion. Such changes would be technical in nature and the question to determine whether community consensus is required could be "Is there any conceivable reason for why someone would oppose this?". For example, the recent change from ReCaptcha v2 to v3. It was of course obvious that the community supported the change but what possible reason could there be to oppose it? But definitely for anything where there is scope for disagreement in the community or for differences of opinion a discussion of some sort should take place, I'd just change the scope a bit from the above. Reception123 (talk) (C) 09:35, 9 September 2021 (UTC)