Tech:Incidents

From Meta
Jump to: navigation, search

All incidents which can be made public, will be made public with written reports detailing how the incident began, social and technical factors and how it can be prevented in future. User facing impacts should always have a report filed, regardless of size, to record preventable actions.

Incident Response[edit source]

All issues should be treated equally regardless of their impact. Any outage, will always be a critical one to someone - whether it is users, active volunteers, software or system administrators.

The immediate response at any reported or suspected outage is to diagnose the cause, the service stack, the impact on each type of user and any fallout that may likely occur from the outage (missing data, rejected contributions etc.).

After a suitable or plausible cause has been found, fixing it then becomes the major priority. Long term or efficient fixes should not take priority over an immediate fix, work on restoring service then work on producing an effective fix which prevents the cause occurring again.

Once everything is returned to it prior state of functioning as before, time should be taken to write an incident report up which will then act as a guide for the prevention of the incident again, work surrounding the incident and a reference guide if a similar incident was to occur again.

All incidents should follow a diagnose->restore->report->solve pattern. Communication is vital as the community is not that interlocked with the team. Communication should always take place on IRC in #miraheze.

Table of Incidents[edit source]

Date started length Services Affected details
2017-04-20 15 hours All services that relied on db2 (MediaWiki, Piwik) Tech:Incidents/2017-04-20-Database‎‎
2016-12-05 Sporadic outages/uptime from late 12/5 to afternoon 12/8 (3 days) All services that relied on MariaDB (MediaWiki, Phabricator and Piwik) Tech:Incidents/2016-12-Database
2016-10-18 29 hours All wikis have been completely inaccessible (not readable, not editable) Tech:Incidents/2016-10-18-Database
2016-07-10 The incident started at 22:24 UTC (July 10th) and resolved at 05:26 (July 11th) Tech:Incidents/2016-07-10-cp1
2016-05-23 *00:00 mw1 php5-fpm crashes due to OOM (see below for dmesg)
  • 01:07 revi restarts HHVM, mw1 recovers
site (~50%, everything that was not cached in Varnish) Tech:Incidents/2016-05-23-mw1

Incident Reports[edit source]

The template for incident reports is here

List of incidents[edit source]