Tech:SRE noticeboard

Welcome to the SRE noticeboard. This page is used to post updates from SRE, not for general help. For assistance, see the Help center.

Cloud14 issues
 LATEST UPDATE: 

Originally posted on 28 December, 2022:

Most wikis that were only the original db141 have now been brought back online. The db141 hostname has been retired due to it's history, and is now replaced with db142.

A few of the original ones have not yet been able to be restored, and wikis that had been recreated after the initial outage are also still not back online yet. We are still working at this process to bring wikis back online fully.

Wikis created (initial creation) after November 25 will not be possible to be restored. However, we have backups for all public wikis beginning with A-O, and can restore those from backups.

If you requested a recreation after the initial outage, it will take a bit more time to get those wikis back online. The version that will eventually be restored will be the pre November 16 version. If you wish to have the wiki reset, or imported from personal backups, or if it is one we have backups from, you may later request a reset of that wiki, to this state.

Also not that around 20 original wikis were (so-far) unable to be restored due to varying degrees of issues, and tables missing/corruption. We are not certain in the ability to later restore these wikis. The updated list of affected wikis may be read here.

We will try our best to get us fully operational again. Thank you for bearing with us during this time, and for all the patience.

I would also like to note that in light of these issues, some additional plans have been put in place, such as active automated wiki backups again (stored externally, off-site). See Backups. Additionally, an internal SRE policy has been drafted and should become public soon, to prevent the issues that have occurred during the previous maintenance window. It formalizes our policy regarding server maintenance, and requires backups be 100% complete and stored off the affected servers before any maintenance can be conducted.

We are very sorry for all the issues that has occurred over the past month, and hope to never have an issue like this again, and backups are in place.

If you encounter certain database exceptions on any of the now-restored wikis, please let us know. Thank you!

Update: Wikis originally recreated are now being brought back online with original data. No data after recreation. If a backup exists, they may be imported. Or you may later request a reset.


 * 20:00 (UTC), Monday, 26 December 2022 - We have uploaded the databases to our new database server. We are now working to get the wiki back online. Due to the holidays, this may take a bit longer than usual. Thank you for your understanding.

Originally posted on 19 December, 2022:

We have very important news regarding db141.

Yesterday, we were able to access and recover the data on the corrupted drives, including the drives which contained the original. We intend to begin restoring affected wikis as soon as possible and we will be releasing more information about it once we get details finalised.

Now, during our scheduled maintenance yesterday, we encountered an issue attempting to get new storage drives detected by our  server. We asked our hosting provider to reseat the disks but we then deemed it unnecessary and cancelled the request. Unfortunately, the request was still being processed and our hosting provider mistakenly reseated 's drives while the server was on. Due to this, the server locked up and we had to run some file system repair tools (fsck specifically) to get it back online.

Once  and all its virtual servers came back up, we discovered that because the new   was running and writing data when that happened, the database had become corrupted. Thankfully, we ran backups for most wikis yesterday so they should be safe. This has made the task of restoring wikis affected by the original  outage a bit easier. What we plan to do is restore all original  wikis from the recovered disks and then, using our backups, merge new edits made on the recreated wikis back into these original wikis. We do not have an ETA for when we will do this but we are thrilled to have recovered the data.

We apologise for yet another downtime on these wikis but this incident has helped foment stronger backup procedures to prevent catastrophic disasters from occurring. We are now working to restore these wikis and we will provide more information once we have it. Thank you for your understanding.

TL;DR: We recovered the data from the broken, original db141 disks but the new db141 was corrupted due to a hosting provider error. We have backups from yesterday so we will now restore the original db141 and merge the edits from recreated wikis back into the old wikis. Miraheze Site Reliability Engineering 00:00, 26 December 2022 (UTC)