Tech:Incidents/2017-04-20-Database

From Meta
< Tech:Incidents
Revision as of 13:52, 3 May 2017 by Revi (talk | contribs) (→‎Actionables: task-fied)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

DRAFT Database maintenance on db2.miraheze.org resulted in duplicated InnoDB dictionary IDs, which would result into a situation where the InnoDB recovery process would fail after a crash.

Summary[edit source]

  • What services were affected?
    • All services where any part has a dependency on db2 (Piwik, MediaWiki[1])
  • How long was there a visible outage?
    • From 2017-04-20 02:00 UTC until 2017-04-20 16:49 UTC (15 hours)
  • What was/were the response times by each Operations member?
    • NDKilla responded at 02:45 UTC and tried to recover the database, unsuccessfully
    • John responded at 08:58 UTC and fixed the database issues for 6 minutes, after which MariaDB crashed again
    • Southparkfan responded at 16:44 UTC and successfully fixed the issue
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • The cause of the crash itself is unknown (but *not* OOM killer);
    • The reason the crash didn't fix itself (failed recovery) is because some dictionary IDs were assigned to two or more tables, due to directly copying tables from one database to another. Since innodb_force_recovery = 0, the recovery process stops immediately when discovering such faults.
  • Was the incident aggravated by human contact, users or investigating?
    • Does not seem to be aggravated in any way.
  • How could response time by improved?
    • NDKilla and John members were able to respond in the first 7 hours of the incident, Southparkfan only showed up 13 hours after the incident began. It was probably luck Southparkfan found a solution, but these database crashes are a tough job. More Operations members with quite a bit of MariaDB experience (overall) seem to be the only way to fix these kind of issues faster.

Timeline[edit source]

All times are in UTC.

<= April 20, 2017 (exact dates unknown)
  • John had performed database work on "bpwiki", "johntestwiki" and "johnmesswiki", during that process /srv/mariadb/johntestwiki and /srv/mariadb/johnmesswiki has tables in them with the same InnoDB dictionary IDs as their original copies in /srv/mariadb/bpwiki
April 20
  • 02:00: MariaDB crashes (cause unknown), but refuses to recover
  • 02:04: First reports of issues by users
  • 02:42 NDKilla: attempts to bring MariaDB back online
  • 02:45 NDKilla: restarted MariaDB (failed, transaction couldn't be rolled back)
  • 08:58 John: fixed owner/group flags of /srv/mariadb/johntestwiki + ran mysqld --tc-heuristic-recover=commit, brought db2 back online
  • 09:02: MariaDB crashes due to assertion failure
  • 14:55: Southparkfan starts investigating the issue on db2
  • 15:00-16:40: MariaDB was restarted a few times with innodb_force_recovery=1. Since MariaDB worked fine with this setting, there may have been a few seconds where the site was back online (though only responding to non-write queries), but php5-fpm was killed as soon as possible to prevent server traffic and dangerous actions (e.g. edits)
  • 16:44 Southparkfan: drops database "johntestwiki" (and johnmesswiki too, somewhere in the process)
  • 16:49 Southparkfan: remove read-only flag on db2

Quick facts[edit source]

  • Due to an unknown cause, the MariaDB instance on db2 crashed.
  • While MariaDB usually performs InnoDB crash recovery, after which the MariaDB will be running normal again, the recovery failed.

Conclusions[edit source]

  • MariaDB crashed at 02:00 with the following errors logged:
170420 02:00:23 mysqld_safe Number of processes running now: 0
170420 02:00:23 mysqld_safe mysqld restarted
170420  2:00:23 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  2:00:23 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  2:00:23 [Note] /usr/sbin/mysqld (mysqld 10.0.28-MariaDB-0+deb8u1) starting as process 4892 ...
170420  2:00:24 [Note] InnoDB: Using mutexes to ref count buffer pool pages
170420  2:00:24 [Note] InnoDB: The InnoDB memory heap is disabled
170420  2:00:24 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
170420  2:00:24 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
170420  2:00:24 [Note] InnoDB: Compressed tables use zlib 1.2.8
170420  2:00:24 [Note] InnoDB: Using Linux native AIO
170420  2:00:24 [Note] InnoDB: Using CPU crc32 instructions
170420  2:00:24 [Note] InnoDB: Initializing buffer pool, size = 768.0M
170420  2:00:24 [Note] InnoDB: Completed initialization of buffer pool
170420  2:00:24 [Note] InnoDB: Highest supported file format is Barracuda.
170420  2:00:24 [Note] InnoDB: Log scan progressed past the checkpoint lsn 142498800993
170420  2:00:24 [Note] InnoDB: Database was not shutdown normally!
170420  2:00:24 [Note] InnoDB: Starting crash recovery.
170420  2:00:24 [Note] InnoDB: Reading tablespace information from the .ibd files...
170420  2:03:18 [Note] InnoDB: Restoring possible half-written data pages
170420  2:03:18 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Doing recovery: scanned up to log sequence number 142499100519
InnoDB: Transaction 307516150 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 307516672
170420  2:03:18 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: Last MySQL binlog file position 0 506367102, file name /srv/mariadb/mysql-bin.000003
170420  2:03:21 [ERROR] InnoDB: Table johntestwiki/archive in the InnoDB data dictionary has tablespace id 434789, but tablespace with that id or name does not exist. Have you deleted or moved .ibd files? This may also be a table created with CREATE TEMPORARY TABLE whose .ibd and .frm files MySQL automatically removed, but the table still exists in the InnoDB internal data dictionary.
InnoDB: Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html
InnoDB: for how to resolve the issue.
170420  2:03:21 [ERROR] InnoDB: Table johntestwiki/category in the InnoDB data dictionary has tablespace id 434790, but tablespace with that id or name does not exist. Have you deleted or moved .ibd files? This may also be a table created with CREATE TEMPORARY TABLE whose .ibd and .frm files MySQL automatically removed, but the table still exists in the InnoDB internal data dictionary.
InnoDB: Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html
InnoDB: for how to resolve the issue.
170420  2:03:21 [ERROR] InnoDB: Table johntestwiki/categorylinks in the InnoDB data dictionary has tablespace id 434791, but tablespace with that id or name does not exist. Have you deleted or moved .ibd files? This may also be a table created with CREATE TEMPORARY TABLE whose .ibd and .frm files MySQL automatically removed, but the table still exists in the InnoDB internal data dictionary.


[the above error was also logged for all other tables in johntestwiki]


InnoDB: Starting in background the rollback of uncommitted transactions
170420  2:03:32 [Note] InnoDB: 128 rollback segment(s) are active.
2017-04-20 02:03:32 7f18befff700  InnoDB: Rollback of non-prepared transactions completed
170420  2:03:32 [Note] InnoDB: Waiting for purge to start
170420  2:03:32 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.32-79.0 started; log sequence number 142499100519
170420  2:03:32 [Note] Plugin 'FEEDBACK' is disabled.
2017-04-20 02:03:32 7f192dfb7780  InnoDB: Starting recovery for XA transactions...
2017-04-20 02:03:32 7f192dfb7780  InnoDB: Transaction 307516150 in prepared state after recovery
2017-04-20 02:03:32 7f192dfb7780  InnoDB: Transaction contains changes to 98 rows
2017-04-20 02:03:32 7f192dfb7780  InnoDB: 1 transactions in prepared state after recovery
170420  2:03:32 [Note] Found 1 prepared transaction(s) in InnoDB
170420  2:03:32 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
170420  2:03:32 [ERROR] Aborting

170420  2:03:32 [Note] InnoDB: FTS optimize thread exiting.
170420  2:03:32 [Note] InnoDB: Starting shutdown...
170420  2:03:33 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
170420  2:03:36 [Note] InnoDB: Shutdown completed; log sequence number 142499103968
170420  2:03:36 [Note] /usr/sbin/mysqld: Shutdown complete

170420 02:03:36 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
  • The second crash (assertion failure after John 'fixed' the issue):
[the error below was also logged for some other tables in johntestwiki]


2017-04-20 09:02:24 7f4a0a4d2700  InnoDB: Error: table `johntestwiki`.`ajaxpoll_info` does not exist in the InnoDB internal
InnoDB: data dictionary though MySQL is trying to drop it.
InnoDB: Have you copied the .frm file of the table to the
InnoDB: MySQL database directory from another database?
InnoDB: You can look for further help from
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
2017-04-20 09:02:24 7f4a0a4d2700  InnoDB: Error: table `johntestwiki`.`ajaxpoll_vote` does not exist in the InnoDB internal
InnoDB: data dictionary though MySQL is trying to drop it.
InnoDB: Have you copied the .frm file of the table to the
InnoDB: MySQL database directory from another database?
InnoDB: You can look for further help from
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
InnoDB: Error: tablespace id is 434789 in the data dictionary
InnoDB: but in file ./johntestwiki/archive.ibd it is 434570!
2017-04-20 09:02:24 7f4a0a4d2700  InnoDB: Assertion failure in thread 139955977135872 in file fil0fil.cc line 637
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170420  9:02:24 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.28-MariaDB-0+deb8u1
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=17
max_threads=92
thread_count=14
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 218379 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7f4a116e3008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4a0a4d1e88 thread_stack 0x48000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0xc0376e]
/usr/sbin/mysqld(handle_fatal_signal+0x3af)[0x734b9f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f4aa7e568d0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f4aa69ff067]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f4aa6a00448]
/usr/sbin/mysqld[0xae36f8]
/usr/sbin/mysqld[0xae379e]
/usr/sbin/mysqld[0xaec77c]
/usr/sbin/mysqld[0xab8029]
/usr/sbin/mysqld[0xa2989d]
/usr/sbin/mysqld[0xa2a56c]
/usr/sbin/mysqld[0x9e6d10]
/usr/sbin/mysqld[0x9e72f2]
/usr/sbin/mysqld[0xa0dab4]
/usr/sbin/mysqld[0x964d47]
/usr/sbin/mysqld(_Z15ha_delete_tableP3THDP10handlertonPKcS4_S4_b+0x211)[0x738d61]
/usr/sbin/mysqld(_Z23mysql_rm_table_no_locksP3THDP10TABLE_LISTbbbbb+0x5a3)[0x645ac3]
/usr/sbin/mysqld(_Z11mysql_rm_dbP3THDPcbb+0xd65)[0x59a2a5]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x4d04)[0x5c33a4]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1ca)[0x5c528a]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1531)[0x5c6d51]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x25b)[0x695bab]
/usr/sbin/mysqld(handle_one_connection+0x39)[0x695bf9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f4aa7e4f0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f4aa6ab262d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f4a37841020): is an invalid pointer
Connection ID (thread ID): 2873
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

We think the query pointer is invalid, but we will try to print it anyway.
Query: DROP DATABASE johntestwiki

170420 09:02:24 mysqld_safe Number of processes running now: 0
170420 09:02:24 mysqld_safe mysqld restarted
170420  9:02:24 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  9:02:24 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  9:02:24 [Note] /usr/sbin/mysqld (mysqld 10.0.28-MariaDB-0+deb8u1) starting as process 30763 ...
170420  9:02:24 [Note] InnoDB: Using mutexes to ref count buffer pool pages
170420  9:02:24 [Note] InnoDB: The InnoDB memory heap is disabled
170420  9:02:24 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
170420  9:02:24 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
170420  9:02:24 [Note] InnoDB: Compressed tables use zlib 1.2.8
170420  9:02:24 [Note] InnoDB: Using Linux native AIO
170420  9:02:24 [Note] InnoDB: Using CPU crc32 instructions
170420  9:02:24 [Note] InnoDB: Initializing buffer pool, size = 768.0M
170420  9:02:24 [Note] InnoDB: Completed initialization of buffer pool
170420  9:02:24 [Note] InnoDB: Highest supported file format is Barracuda.
170420  9:02:24 [Note] InnoDB: Log scan progressed past the checkpoint lsn 142500426163
170420  9:02:24 [Note] InnoDB: Database was not shutdown normally!
170420  9:02:24 [Note] InnoDB: Starting crash recovery.
170420  9:02:24 [Note] InnoDB: Reading tablespace information from the .ibd files...
170420  9:03:18 [ERROR] InnoDB: Attempted to open a previously opened tablespace. Previous tablespace johntestwiki/msg_resource_links uses space ID: 431701 at filepath: ./johntestwiki/msg_resource_links.ibd. Cannot open tablespace johnmesswiki/msg_resource_links which uses space ID: 431701 at filepath: ./johnmesswiki/msg_resource_links.ibd
2017-04-20 09:03:18 7f0f045d4780  InnoDB: Operating system error number 2 in a file operation.
InnoDB: The error means the system cannot find the path specified.
InnoDB: If you are installing InnoDB, remember that you must create
InnoDB: directories yourself, InnoDB does not create them.
InnoDB: Error: could not open single-table tablespace file ./johnmesswiki/msg_resource_links.ibd
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
170420 09:03:18 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
170420 09:03:37 mysqld_safe Starting mysqld daemon with databases from /srv/mariadb
170420  9:03:37 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  9:03:37 [Warning] option 'table_definition_cache': unsigned value 350 adjusted to 400
170420  9:03:37 [Note] /usr/sbin/mysqld (mysqld 10.0.28-MariaDB-0+deb8u1) starting as process 31210 ...
170420  9:03:37 [Note] InnoDB: Using mutexes to ref count buffer pool pages
170420  9:03:37 [Note] InnoDB: The InnoDB memory heap is disabled
170420  9:03:37 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
170420  9:03:37 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
170420  9:03:37 [Note] InnoDB: Compressed tables use zlib 1.2.8
170420  9:03:37 [Note] InnoDB: Using Linux native AIO
170420  9:03:37 [Note] InnoDB: Using CPU crc32 instructions
170420  9:03:37 [Note] InnoDB: Initializing buffer pool, size = 768.0M
170420  9:03:37 [Note] InnoDB: Completed initialization of buffer pool
170420  9:03:37 [Note] InnoDB: Highest supported file format is Barracuda.
170420  9:03:37 [Note] InnoDB: Log scan progressed past the checkpoint lsn 142500426163
170420  9:03:37 [Note] InnoDB: Database was not shutdown normally!
170420  9:03:37 [Note] InnoDB: Starting crash recovery.
170420  9:03:37 [Note] InnoDB: Reading tablespace information from the .ibd files...
170420  9:04:25 [ERROR] InnoDB: Attempted to open a previously opened tablespace. Previous tablespace johntestwiki/msg_resource_links uses space ID: 431701 at filepath: ./johntestwiki/msg_resource_links.ibd. Cannot open tablespace johnmesswiki/msg_resource_links which uses space ID: 431701 at filepath: ./johnmesswiki/msg_resource_links.ibd
2017-04-20 09:04:25 7f95716e4780  InnoDB: Operating system error number 2 in a file operation.
InnoDB: The error means the system cannot find the path specified.
InnoDB: If you are installing InnoDB, remember that you must create
InnoDB: directories yourself, InnoDB does not create them.
InnoDB: Error: could not open single-table tablespace file ./johnmesswiki/msg_resource_links.ibd
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
170420 09:04:25 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Reporting[edit source]

  • What services/sites were used to report the downtime?
    • Facebook, IRC (topic)
  • What other services/sites were available for reporting, but were not used?
    • Mostly Twitter

Actionables[edit source]

Permanent fix for the caused problems
  • Make clear that everyone with access to a production instance of MariaDB must use SQL queries (e.g. mysqldump) to copy databases;
  • Investigate why db3 wikis were also affected by this incident [PHABRICATOR TICKET NEEDED];
Improve communication to the users
  • Expand access to Facebook if possible T1754
  • Share Twitter credentials/access T1753
  • Improve information on HTTP error pages [PHABRICATOR TICKET NEEDED];

Notes[edit source]

  1. Due to an issue, this also included wikis that were on other shards.

Meta[edit source]

  • Who responded to this incident?
    • Southparkfan, John, NDKilla
  • What services were affected?
    • All services with even the slightest dependency on db2 (Piwik and MediaWiki)
  • Who, therefore, needs to review this report?
    • All Operations members
  • Timestamp: ...