Postmortem-29022020

= The story = This page is a postmortem for the RamNode -> OVH migration attempt that happened between Friday, 14 February and Sunday 16 February in 2020.

Comments about this postmortem can be made inline, on the the Talk: page for this postmortem page, or on Phabricator at https://phabricator.miraheze.org/T5244

Over the weekend, the Site Reliability Engineering team (SRE) worked on the RamNode -> OVH migration (see https://phabricator.miraheze.org/T5221). The migration consisted of migrating 4,000+ wikis and several services (such as Phabricator) to a new database platform, on Miraheze's new servers.

What Happened? (functional)
For several months, Site Reliability Engineering had been developing plans for new technical infrastructure for Miraheze. This was necessary to fix performance problems, increase capacity and provide a new layer of services for the future. In the process, we would move from our current provider (RamNode) to a new one (OVH). The largest part of this operation is moving all databases to the new database platform. Paladox and Southparkfan, members of Site Reliability Engineering, agreed to work on this huge migration process.

On February 14, 2020, Paladox and Southparkfan started this at 19:00 UTC, in accordance with a maintenance window communicated through site notices. Unfortunately, a lack of testing and resource constraints caused multiple technical issues during the migration, which was delayed as such. In the middle of the night one person had to leave, whereas the other one had to leave a few hours later. The maintenance window was already exceeded at this moment. In order to minimise the disruption, the migration process was automated and ran in the background without the need for manual intervention.

While people were sleeping, others woke up and found the wikis to be broken. Due to the lack of communication, the only active member of Site Reliability Engineering (Reception123) didn't know what was going on. Due to a still unknown technical error, the wikis were completely down, but after a quick fix from Reception123, wikis were readable again. However, the wikis were still in read-only mode with the only two situation aware engineers still away.

Eventually Paladox and Southparkfan came back online and decided to abort the operation, move the wikis back to the old infrastructure and open up wikis again. However, due to a mistake, editing was enabled while the wikis were on the new servers. This caused another issue, called data drift. In the process of rolling back to the old cluster, these edits were not copied (to save time). Eventually the rollback was performed successfully and the incident was over.

Finally, another communication mistake led to the new database server to be wiped, and thus causing permanent loss of 30+ edits, which could have been recovered.

Timeline
Due to technical reasons, the 'old infrastructure' does not have one giant database server. Instead, wikis are spread across two database servers, so-called database clusters. Over 4000 wikis are stored on one server, we call this server 'database cluster 1' or 'c1'. The other ~10 wikis are stored on the other server, we call this one 'database cluster 2' or 'c2'. All these wikis were to be migrated to a new server, called 'database cluster 3' or 'c3'.

Involved people: Paladox, Reception123 and Southparkfan, on behalf of Site Reliability Engineering

All times are in UTC.

February 14, 2020 -> Note 1: in the meantime, Southparkfan discovers plan A would take over 40 hours. Plan B was invented. The downside of plan B is that it will cause all wikis to go down completely. -> Note 2: in the meantime, Southparkfan discovers plan B would still exceed the predicted maintenance window by a few hours, so plan C was invented.
 * 19:06: Southparkfan puts all wikis into read-only mode. This means they cannot be edited anymore.
 * 19:26: Southparkfan begins transferring the wikis (c1) to the new servers, using so-called plan A.
 * 20:48: Paladox announces through social media channels that SRE is going to change plans, to the so-called plan B. Wikis will no longer be readable for the rest of the migration.
 * 21:44: Southparkfan switches to plan C.

February 15, 2020 -> Note 3: in the background wikis on database cluster 2 are still being imported to database cluster 3 during the rest of the night. -> Note 4: in the coming hour several people (on IRC), including Reception123, are wondering what happened and why technical issues are still occurring.
 * 01:08: Paladox says most of the wikis (on database cluster 1) have been transferred successfully. All wikis are back online, in read-only state.
 * 01:41: Due to technical difficulties, all wikis were taken offline again to improve the speed of fixing the remnants of the migration.
 * 01:46: Southparkfan leaves due to time constraints. Paladox takes over.
 * 01:46 - 05:14: Paladox works on importing wikis from database cluster 2.
 * 05:14: Paladox heads off.
 * 06:40: Reception123 gets a question from a user on IRC why wikis are still down.
 * 13:30: Paladox comes online again.
 * 14:40: Southparkfan comes online again.
 * 14:58: Southparkfan and paladox agree that the import going on in the background (see 'Note 3') is taking too long. The import is waiting for one wiki (the last wiki out of all 4,000+) to be imported. It has been decided to keep this one wiki on the old servers, for now.
 * 15:06 - 16:26: Southparkfan discovers database cluster 3 is not capable of handling Miraheze's traffic. Several attempts were made to improve the situation, but without luck.
 * 16:28 - 16:50: Paladox and Southparkfan rolled back to the old infrastructure.
 * 18:18: incident declared over.

What Happened? (technical)
On Friday 14th February 2020 at 19:00 UTC, the Site Reliability Engineering team started the planned maintenance of switching to the new infrastructure (RamNode -> OVH), which was supposed to finish at 02:00 UTC. During the migration Southparkfan found that the way he was currently copying files over was far too slow and would have lead to a 2-3 day migration window. It was decided to attempt plan B, which was to use rsync under SSH which was much faster. By the time things were copied over, the 7-hour maintenance window was almost up, and db5 still needed to be done. Southparkfan did one DB before going to bed. Paladox handled the rest which is where the problem began. Paladox started the import of some dbs 15 minutes before the maintenance window was supposed to end, leaving the wikis in a bad state. Due to a lack of communication and the absence of a written plan, Reception123 was unaware of what was happening and thought it was the jobrunner causing all the issues.

The wikis were eventually rolled back to the old cluster, but editing was erroneously enabled for a brief period while the c1 wikis were on the new cluster. Due to this, db6 already had edits that were not present on db4. These edits were not copied in the rollback and thus, data drift was caused. Unfortunately, due to another communication error, the new database server was wiped permanently, despite internal instructions to recover 30+ known-to-be-missing edits.

Technical Timeline
All times are in UTC.

February 14, 2020 February 15, 2020
 * 19:06: Southparkfan puts all wikis into read only, marking the beginning of the maintenance window.
 * 19:26: Southparkfan begins dumping db4 to db6 using mysqldump piping.
 * 20:48: Paladox announces that there has been a change of plans due to the time it was taking to move db4 over to db6 using our current method (SCP).
 * 20:53: Southparkfan shuts down mariadb on db4 to prepare for plan B.
 * 21:05: Southparkfan tries plan B which is to use SCP to copy the dbs from db4 over to db6.
 * 21:44: Southparkfan switches to plan C which is using rsync over ssh, which was much faster then using SCP.
 * 23:46: Southparkfan announces he has finished copying the db's over to db6 and signifies to Paladox to start mariadb on db6.
 * 00:08: Paladox starts mysql_upgrade to upgrade all the db's to mariadb 10.4.
 * 01:08: Paladox signifies to Southparkfan that mysql_upgrade has ran and he starts nginx.
 * 01:16: Southparkfan signifies that he is ready to copy over db5 to db6.
 * 01:41: Paladox stops nginx on mw[4567] as having live traffic hit the db was causing high i/o making the process very slow.
 * 01:46: Southparkfan leaves for the night (after restoring one db from the db5 batch, with paladox to finish the rest). Paladox took over and started restoring more dbs from the db5 batch.
 * 14:38: Paladox left allthetropeswiki in read only as it still hand't been restored on db6. Also he took all wikis out of read only.
 * 16:13: Southparkfan announces on discord that we are switching back to the old infrastructure.
 * 16:28: Paladox put all wikis back into read only after we decided we were going to switch back to the old infrastructure. This led to some data loss regarding edits. Southparkfan put out a notice for users to request that there edit's be restored.
 * 16:28 - 16:50: Paladox and Southparkfan rolled back to the old infrastructure by putting cp4 back into service, and putting the old mw[123] back into service too.
 * 17:49: Paladox sets read_only on the db servers.
 * 18:18: Paladox takes all wikis out of read only, incident declared over.

Banner posted after the weekend
Here is the banner that was posted after the weekend.

This is very important! If you have performed an action (edited/created a page, uploaded content, created an account, etc. basically everything that's not reading) between 14:45 UTC and 16:30 UTC at 2020-02-15, you can't see those anymore. For (most) public wikis, we are able to find edits, but for private wikis we cannot do that. If you have edited/uploaded during this time period something, regardless of whether the wiki is public or private, please contact us as soon as possible using our procedure. It can be found at https://phabricator.miraheze.org/maniphest/task/edit/form/15/. Regarding the migration issues, system administrators are working on fixing remnants of the rollback. All wikis can be read and edited without issues now. Miraheze would like to apologise again, due to the huge complexity of this migration, things went wrong (not only technical wise but also communication wise). We are focussing on restoring full functionality, that is our highest priority now. A post-mortem is in the works and will be provided when ready

Retrospectives
Each subsection below is a retrospective told in first-person voice. It optionally starts with an answer to the implicit question "why should we care what this person thinks", and then has two subsections: "what went well?" and "what should we look into?". The first section below was written by User:RobLa (as of 2020-02-18) because this is a postmortem format he created, and wanted to give an example. The other sections are for any other person who wants a section to add their perspective.

User:RobLa’s retrospective
I wasn't terribly involved in the migration. I'm on the board, so I was one of the people who approved the general plan to move many services from OVH to RamNode, but I trusted the site reliability team to keep the site running. As of 2020-02-17 (PST), I'm still learning much of what happened.

What went well

 * The site seems to be running now (on Monday afterward), and the wikis I normally read were all available every time I looked over the weekend
 * I was largely unaffected. And it's all about me!  Whee!  ;-)

What should we look into?

 * Did we actually lose a few edits because of migration problems? Should we have put the site in read-only mode during the migration?
 * Did we have a written timeline for this migration?
 * We should have a big "planned site outage" timeline template. I may need to volunteer to write that
 * What's the revised plan for the RamNode migration?
 * Tuesday morning PST was the first time I really had problems using the site. Seems to be working fine for me as of this edit (at 21:13, 18 February 2020 (UTC))

What went well

 * Ensuring there was CentralNotices/Site notices

What should we look into?

 * Why was it left to me and Reception to blindly troubleshoot why wikis couldnt be read, and then “fix” the issue by starting jobrunner on jobrunner1? (Why wasn’t it started in the first place?)
 * The lack of telling uninvolved sysadmins what steps were being done
 * Leaving no information on what can be done in case of failure, this could of been useful for when me and reception were trying to blindly fix things.
 * Wheres the documentation?
 * Test, test, test. From what I can tell, it doesn’t seem like anything was tested once moved to new infra (I’m basing this off the fact jobrunner wasnt even started)

What went well

 * The initial communication prior to the migration (Sitenotice, Twitter, etc.)
 * The idea of not having to require read-only was an interesting one

What should we look into?

 * A new plan for a migration to OVH
 * Make sure that next time in case whoever does the migration has to leave they either make sure everything works properly or at least leave instructions for others to take over
 * Make sure to test out an example wiki and make sure there are no latency/slow connection problems as there were this time
 * Have better communication during a migration

What went well

 * Communication was great up till Southparkfan left for the night.

What should we look into?

 * A better plan should have been done, which has now been done at https://etherpad.wikimedia.org/p/zv81kYuPnWa3T2brEc0r
 * The migration should be done during the day, that way your not working through the night/having to leave things due to tiredness.
 * Real testing using real wikis should have been done (this has now been done).
 * Communication should be better.

What went well

 * Up until the point of Paladox and me leaving, communication with the community was excellent.
 * Initial technical idea of the database migration was an excellent one, and still is. The biggest benefit is only requiring a small twenty minutes (at most) to perform the database failover.
 * If it weren't for the data drift caused by the lack of read-only mode for over an hour, the migration would have completely went without any data loss. During the db4 -> db6 migration, all data was intact.

What should we look into?

 * Planning! About everything done here was ad-hoc.
 * Splitting the processes. Moving 4,000+ wikis in one go takes too long and does not allow us to test the new infrastructure by loading it with just a part of our wikis.
 * Communication with the rest of SRE. I don't think we made a mistake by going with Paladox and me for the migration, but aforementioned plan should have been made and forwarded to SRE, so they could have been aware what was going on.
 * Documenting the technical lessons of this migration. rsync performs way better than SCP and a migration without much downtime (piping mysqldump --single-transaction via SSH to a new server) takes much, much more than the planned seven hours.

= Lessons learned / things to do better = This section is where we come up with a collaborative plan to figure out what to do next. These should be brief bullet points.

Action items

 * Create a "planned site outage" timeline template
 * Engage with the community for getting communication help during migrations
 * Make a written plan (and a better one too)
 * Split into multiple, smaller migrations
 * Who does what? Whose responsibilities? Who does communication? Who does the technical aspects of the migration?
 * Making sure there's someone in a B team in standby ready to take over things in case a member of A team has to go afk due to RL? CleavagesCwars (talk)