Tech:Incidents

From Meta
Jump to: navigation, search

All incidents which can be made public, will be made public with written reports detailing how the incident began, social and technical factors and how it can be prevented in future. User facing impacts should always have a report filed, regardless of size, to record preventable actions.

Incident Response[edit source]

All issues should be treated equally regardless of their impact. Any outage, will always be a critical one to someone - whether it is users, active volunteers, software or system administrators.

The immediate response at any reported or suspected outage is to diagnose the cause, the service stack, the impact on each type of user and any fallout that may likely occur from the outage (missing data, rejected contributions etc.).

After a suitable or plausible cause has been found, fixing it then becomes the major priority. Long term or efficient fixes should not take priority over an immediate fix, work on restoring service then work on producing an effective fix which prevents the cause occurring again.

Once everything is returned to it prior state of functioning as before, time should be taken to write an incident report up which will then act as a guide for the prevention of the incident again, work surrounding the incident and a reference guide if a similar incident was to occur again.

All incidents should follow a diagnose->restore->report->solve pattern. Communication is vital as the community is not that interlocked with the team. Communication should always take place on IRC in #miraheze.

Incident Reports[edit source]

The template for incident reports is here

List of incidents[edit source]