Tech:Incidents/2016-10-18-Database

Summary
A search query crashed MariaDB on October 18, getting MariaDB into a state where InnoDB corruption was present and it refused to restart. An attempt to restore Bacula backups (2 days old) failed. All wikis have been completely inaccessible (not readable, not editable) for almost 25 hours, and the whole incident lasted 29 hours.

Timeline
All times are in UTC.
 * October 18
 * 17:17: MariaDB crashes
 * 17:18 PuppyKun: noticed two DB errors and repeated 503s, notified JohnLewis
 * 17:18 JohnLewis: acknowledges errors and tries to restart MariaDB
 * ~17:30 JohnLewis: reboots db2
 * 19:16 JohnLewis: alerts Southparkfan via e-mail
 * 19:19 Southparkfan joins IRC
 * 19:46 Southparkfan: tries to find out which databases are corrupt using "mysqlcheck --all-databases", but needs to get MariaDB running first
 * 20:24 Southparkfan: rsyncing /srv/mariadb @ db2 to the same server into another location, backup measure before trying to touch the actual databases
 * 21:10-21:41: John and Southparkfan issue a Bacula job to restore the most recent database backup (2 days old) on db1
 * 22:09 Southparkfan: reports db2 backup is complete
 * 22:45: Bacula restore progress is 39%
 * October 19
 * 01:05: Bacula restore is done
 * 01:10 Southparkfan: tries to restart MariaDB on db1 with the Bacula backup, but MariaDB refuses to restart here as well
 * 01:54 Southparkfan goes to sleep
 * 03:58 NDKilla: set InnoDB force recovery to level 6 on db2, MariaDB came alive (previously tested at 3 through 5 with no success)
 * 03:58-05:02 NDKilla: Restarted MariaDB on non-standard port to prevent external connections as he wasn't sure if the databases were in read only mode. During this time NDKilla also started to dump databases one at a time to another directory on db2, but stopped at allthetropeswiki because of disk space and time reasons.
 * 05:02 NDKilla: stopped MariaDB on db2
 * 05:02: NDKilla sends an email to other staff members explaining what he's done and apologizing for ridiculously high CPU usage, then goes to bed
 * 11:01: NDKilla and Southparkfan are having a conversation about level 6 InnoDB recovery
 * 11:29-11:57 Southparkfan: manually running mysqld_safe on db2 with level 6 InnoDB recovery, dumping metawiki on db2 and working on getting scp work between db2 and db1
 * 11:57 Southparkfan: moved corrupt Bacula backup out of the way, re-installed MariaDB on db1
 * ~11:58 Southparkfan: started importing metawiki dump on db1
 * ~11:58 Southparkfan: changed DB server to db1 on mw1, depooled mw2
 * 12:00: confirmed metawiki was back online
 * 12:08 Southparkfan: commited db1 patch, repooled mw2
 * 12:00-12:30 Southparkfan: recovered several other wikis to see if my method was working
 * 13:09 Southparkfan: starting the first recovery batch. Dumped 267 wikis, transferred those to db1 and importing them all there
 * From now on Southparkfan was running batches (~200-250 wikis each batch) all the time.
 * 14:50: 505 wikis imported
 * 14:55 Southparkfan: deleted Bacula backup from db1
 * 16:41: 1168 wikis imported
 * 17:50: all wikis (except All The Tropes) imported
 * 18:00: Site Reliabilty Engineering decided to continue operating in read-only mode, after ATT has been backupped to db1 we'll be working on transferring db1's SQL files to db2 again
 * 18:55 Southparkfan: a compressed version of ATT's database dump has been transferred to db1
 * 19:00-19:34: Southparkfan was not sure if all databases were imported, some valuable time was wasted here
 * 19:41 Southparkfan: gives John an OK for re-installing db2
 * 19:42-~21:55 JohnLewis: re-installing db2, and importing all files (except ATT) from db1 to db2
 * 21:59 JohnLewis: remove read-only flag for all non-ATT wikis
 * 22:36 JohnLewis: removed read-only flag for ATT and global site notice
 * 22:43: JohnLewis declares 'migration done'
 * 22:43: incident is over

Conclusions
2016-10-18 17:17:18 7fc84e16c700 InnoDB: Assertion failure in thread 140498280302336 in file fts0que.cc line 3391 InnoDB: Failing assertion: ret == 0 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html InnoDB: about forcing recovery. """"""""""""""""161018 17:17:18 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.
 * A FULLTEXT search query crashed MariaDB:

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.

Server version: 10.0.27-MariaDB-0+deb8u1 key_buffer_size=33554432 read_buffer_size=131072 max_used_connections=76 max_threads=77 thread_count=18 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 201826 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7fc802837008 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7fc84e16be88 thread_stack 0x48000 /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0xbfff4e] /usr/sbin/mysqld(handle_fatal_signal+0x3af)[0x7344af] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7fc8b043b8d0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc8aefe4067] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc8aefe5448] /usr/sbin/mysqld[0xa54a79] /usr/sbin/mysqld[0x8ad0e5] /usr/sbin/mysqld(_ZN15Item_func_match11init_searchEb+0x3f3)[0x79a0d3] /usr/sbin/mysqld(_Z12init_ftfuncsP3THDP13st_select_lexb+0x30)[0x57cb90] /usr/sbin/mysqld[0x615b00] /usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x11b)[0x616fdb] /usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0xa47)[0x61af57] /usr/sbin/mysqld[0x5b7beb] /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x4c64)[0x5c3574] /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1ca)[0x5c523a] /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1531)[0x5c6d01] /usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x25b)[0x69543b] /usr/sbin/mysqld(handle_one_connection+0x39)[0x695489] /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7fc8b04340a4] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc8af09762d]

Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0x7fc8068af020): is an invalid pointer Connection ID (thread ID): 10364277 Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash.

We think the query pointer is invalid, but we will try to print it anyway. Query: SELECT /* SearchMySQL::searchInternal [REDACTED] */ page_id,page_namespace,page_title  FROM `page`,`searchindex`   WHERE (page_id=si_page) AND ( MATCH(si_text) AGAINST('+\"[REDACTED]\" ' IN BOOLEAN MODE) ) AND page_namespace = '0'  LIMIT 20

161018 17:17:21 mysqld_safe Number of processes running now: 0 161018 17:17:21 mysqld_safe mysqld restarted 161018 17:17:21 [Note] /usr/sbin/mysqld (mysqld 10.0.27-MariaDB-0+deb8u1) starting as process 29358 ... 161018 17:17:22 [Note] InnoDB: Using mutexes to ref count buffer pool pages 161018 17:17:22 [Note] InnoDB: The InnoDB memory heap is disabled 161018 17:17:22 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 161018 17:17:22 [Note] InnoDB: GCC builtin __atomic_thread_fence is used for memory barrier 161018 17:17:22 [Note] InnoDB: Compressed tables use zlib 1.2.8 161018 17:17:22 [Note] InnoDB: Using Linux native AIO 161018 17:17:22 [Note] InnoDB: Using CPU crc32 instructions 161018 17:17:22 [Note] InnoDB: Initializing buffer pool, size = 1.0G 161018 17:17:22 [Note] InnoDB: Completed initialization of buffer pool 161018 17:17:22 [Note] InnoDB: Highest supported file format is Barracuda. 161018 17:17:22 [Note] InnoDB: Log scan progressed past the checkpoint lsn 50369686803 161018 17:17:22 [Note] InnoDB: Database was not shutdown normally! 161018 17:17:22 [Note] InnoDB: Starting crash recovery. 161018 17:17:22 [Note] InnoDB: Reading tablespace information from the .ibd files... 161018 17:19:10 mysqld_safe Starting mysqld daemon with databases from /srv/mariadb 161018 17:19:10 [Note] /usr/sbin/mysqld (mysqld 10.0.27-MariaDB-0+deb8u1) starting as process 29849 ... 161018 17:19:10 [Note] InnoDB: Using mutexes to ref count buffer pool pages 161018 17:19:10 [Note] InnoDB: The InnoDB memory heap is disabled 161018 17:19:10 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 161018 17:19:10 [Note] InnoDB: GCC builtin __atomic_thread_fence is used for memory barrier 161018 17:19:10 [Note] InnoDB: Compressed tables use zlib 1.2.8 161018 17:19:10 [Note] InnoDB: Using Linux native AIO 161018 17:19:10 [Note] InnoDB: Using CPU crc32 instructions 161018 17:19:10 [Note] InnoDB: Initializing buffer pool, size = 1.0G 161018 17:19:10 [Note] InnoDB: Completed initialization of buffer pool 161018 17:19:10 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:10 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:10 [Note] InnoDB: Retrying to lock the first data file 161018 17:19:11 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:11 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:12 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:12 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:13 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:13 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:14 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:14 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:15 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:15 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:16 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:16 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:17 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:17 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:18 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 161018 17:19:18 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files. 161018 17:19:19 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11 2016-10-19 01:10:55 7feb77fdb780 InnoDB: Assertion failure in thread 140649307158400 in file fil0fil.cc line 5598
 * A restart failed:
 * Restoring a Bacula backup on db1 failed:
 * The way to dump all wikis was by setting innodb-force-recovery to 6. Southparkfan had been messing with that variable a big part of the evening, but nobody tried setting it to higher than 3. After 10.5 hours site outage NDKilla finally found the way to get MariaDB running with non-corrupted data.
 * Two out of three members of the Site Reliabilty Engineering team live in Europe (NL and UK), and the incident started in the evening. Southparkfan left at 4AM local time (shortly before NDKilla found out how to dump all databases), and 9 hours passed by before Southparkfan came back. During those 9 hours there was only one system administrator online. John's absence was a few hours longer. While real-life obligations are unavoidable (thus no personal attack towards anyone), it is unfortunate a lot of hours went by without multiple available Site Reliabilty Engineering members.
 * We originally suspected a Bacula failure and MariaDB bug, but after Southparkfan opened a bug report, it seems data corruption happened before the incident, and is rather the cause of this incident and refusing to restart.
 * The above theory is supported by running the offending query after the db2 re-install: it ran without issues.

Actionables

 * Report FULLTEXT bug to MariaDB: ✅ MDEV-11093
 * Find out why corruption happened: ❌ - corrupted data gone after db2 install
 * Find out if the restore failure was caused by corruption or Bacula (restore a fresh backup on another server): in progress
 * Find out if there is a tool to detect data corruption (could be integrated with Icinga): in progress

Meta

 * Incident handled by: John, NDKilla and Southparkfan (with the help of Labster and Reception123)
 * Affected services: all services depending on db2
 * Signature: 