Tech:SRE noticeboard/Archive 1

__NOINDEX__

Cloud14 issues
We are very pleased to inform users that we have been able to restore almost all wikis affected by the November crash of db141. Most wikis affected by the initial crash should be online once again. A very small subset of wikis (initially 19 wikis, now only 2) were affected by varying levels of database corruption and additional intervention by our system administrators is required. You may view the original list here. We are working to get these back up and running again and will contact their wiki's bureaucrats were possible to inform them if any additional steps are required from their part.

Wikis created after November 16th which are on  have now been reopened. We have backups of all public wikis on db141 from the letter A-O and part of P and will reimport these backups automatically for any post-November 16 wiki. Pre-November wikis will need to opt-in to restoration as several users have expressed a desire to not restore these backups.

Please read the following FAQ for more information.

What happened?
On 16 November 2022, one of our cloud servers,, suffered disk failures which led to all services on it, including  , to be brought offline. We were able to successfully restore the data on those servers and have brought most of the initially affected wikis back online. On the day that we were able to restore the data, we were conducting scheduled maintenance and requested our hosting provider reseat some disks. While we cancelled that request, the on-site personnel did not see the cancellation on time and proceeded to reseat the disks but reseated the wrong disks. Instead of reseating the disks for  (as we requested), they reseated the disks for   which was on and this action ultimately caused data corruption. Cloud14 hosted a new database server (also called  but unrelated to the one involved in the November crash, also called   colloquially) which was affected by the December crash.

What does this mean?
While wikis initially affected by the  outage in November are back online, wikis created on the 'new'   (db141.1) after that were still inaccessible. All post-November crash wikis on  have been recreated.

What wikis on db141.1 were backed up?
All public wikis from A-O (inclusive) had a database dump done for them when the disk issue occurred in December. Some wikis in P were captured but not all. Wikis after that and private wikis did not have a database dump done for them just yet.

I created a wiki after the November crash, why am I affected?
Due to a blunder by our hosting provider, the disks containing the 'new'  were corrupted which is why your wiki is offline.

All wikis created after the November crash are back online. If you are experiencing any issues, please contact us.

I created a wiki after the November crash, when will my wiki return?
Reopening of impacted wikis was completed on 03:49, 30 December 2022 (UTC)

My wiki was affected by the November crash and recreated by request, how can I restore the edits made on the 'recreated' version?
If we have a backup of your wiki, we have an opportunity to merge in edits if requested. This will be done by default for all wikis created after the November outage with no action required

The newest edit on a page will prevail as the version shown. If you restored your wiki previously but had not edited a page since before the wiki went down then the page should remain the same. If you have edited the page since it went down then the newest edit will be the one shown. All other edits will appear as normal in the edit history.

How do I request an import for my December 18 backup?
To request a backup, you must meet two requirements:
 * 1) You must be a Bureaucrat on the wiki for which you are filing the request
 * 2) The wiki must meet requirements for backup listed above in the section detailing which wikis were backed up.

We are providing two avenues for making this request:
 * Requests can be filed by authenticated users in this discord thread
 * Requests can also be filed by making a post on the Community noticeboard, please include the URL of the wiki for which you are requesting import of backups.

I don't want edits merged in!
No action is needed on your part if your wiki existed prior to the November crash. Per overwhelming feedback, we will NOT be automatically importing for this set of wikis.

My wiki was affected by database corruption, what can I expect?
System administrators are trying to do everything possible to restore these wikis. Most wikis have been fixed and will be coming back online shortly. We will notify wiki bureaucrats if anything is needed from their part.

I don't want my wiki restored from it's pre-November crash state, I like it better as it was prior to the December crash, what can I do?
We can delete the restored wiki and restore any backup we have on hand. Please first file a request to reset your wiki on Phabricator, once this is done, follow the above instructions to request an import of the Dec 18 backup.

What is Miraheze doing to prevent this from happening ever again?
We have introduced robust backups to prevent a widescale issue like this from affecting us as it did and to help us come back online faster than this time. During this incident, as some vital services such as Puppet were affected, we were unable to do much for the first 2-3 days while we restored that and other services such as Mail, Monitoring, and others. Initially, our backups were also outdated which caused an issue with restoring wikis. We have now introduced reliable backups and we are fully committed to publishing monthly public backups on archive.org that our users can download and see that they're there to provide more ease and comfort. You may see our backup schedule at Backups.

The below are archived announcements relating to this.


 * 20:00 (UTC), Monday, 26 December 2022 - We have uploaded the databases to our new database server. We are now working to get the wiki back online. Due to the holidays, this may take a bit longer than usual. Thank you for your understanding.

Originally posted on 19 December, 2022:

We have very important news regarding db141.

Yesterday, we were able to access and recover the data on the corrupted drives, including the drives which contained the original. We intend to begin restoring affected wikis as soon as possible and we will be releasing more information about it once we get details finalised.

Now, during our scheduled maintenance yesterday, we encountered an issue attempting to get new storage drives detected by our  server. We asked our hosting provider to reseat the disks but we then deemed it unnecessary and cancelled the request. Unfortunately, the request was still being processed and our hosting provider mistakenly reseated 's drives while the server was on. Due to this, the server locked up and we had to run some file system repair tools (fsck specifically) to get it back online.

Once  and all its virtual servers came back up, we discovered that because the new   was running and writing data when that happened, the database had become corrupted. Thankfully, we ran backups for most wikis yesterday so they should be safe. This has made the task of restoring wikis affected by the original  outage a bit easier. What we plan to do is restore all original  wikis from the recovered disks and then, using our backups, merge new edits made on the recreated wikis back into these original wikis. We do not have an ETA for when we will do this but we are thrilled to have recovered the data.

We apologise for yet another downtime on these wikis but this incident has helped foment stronger backup procedures to prevent catastrophic disasters from occurring. We are now working to restore these wikis and we will provide more information once we have it. Thank you for your understanding.

TL;DR: We recovered the data from the broken, original db141 disks but the new db141 was corrupted due to a hosting provider error. We have backups from yesterday so we will now restore the original db141 and merge the edits from recreated wikis back into the old wikis. Miraheze Site Reliability Engineering 00:00, 19 December 2022 (UTC)

The cloud server (cloud14) which hosts one of our database, db141, experienced a disk issue. As a result, a small number of wikis hosted on db141 are unavailable. We have reinstalled the affected server on new disks and are working to recover the data from the affected disks. Earliest ETA of these wikis being back online is early next week. We deeply apologise for the inconvenience but rest assured we're working diligently to have this issue fixed ASAP.


 * LATEST UPDATE Outdated, see above.


 * 4AM (UTC), Tuesday, Nov. 29 - The affected disks have been shipped to Owen as of November 24th. We are still in the process of determining how to recover the data and if it is even feasible by our means. The previous update has been amended to reflect the fact that we have not yet involved a professional data recovery service as it may be prohibitively expensive to do so.


 * 2AM (UTC), Monday, Nov. 21 - We have reinstalled cloud14 and have began re-provisioning servers affected by the disk issue. Mail and IRC bots are now functional. We are working on re-provisioning servers for MediaWiki which should improve loading speeds. We are in the process of sending the disks containing db141 to Owen to review the physical disks and determine how to proceed with professional data recovery and the earliest ETA we can provide for when wikis may be back online is early next week.

A cloud server (cloud14) hosting one of our database, db141, ran into disk issues. As a result, the database cannot be accessed and some services hosted by the cloud server have been knocked offline. We have reinstalled the affected cloud server on new disks and are working to restore affected services.
 * FAQ
 * What happened?

Only wikis on db141. Affected wikis display an error saying "Wiki temporarily unavailable." Most wikis on Miraheze are fine.
 * Who is affected?

While cloud14 has been reinstalled, we will have to send the affected disks to professional data recovery. The earliest ETA for having wikis restored is potentially early next week.
 * When will this be fixed?

We are unsure. It may be possible that the disks are not actually faulty but rather that the RAID controller is which would mean your data is safe, or it's possible the actual disks have gone bad. If it is the latter, that would indicate we received a bad batch of SSDs from the manufacturer.
 * Is data loss involved?

At this moment, the only user-facing affected server is MediaWiki due to some servers being knocked offline. We are working to provision new MediaWiki servers which should fix loading.
 * What other services are affected?

We have reinstalled the affected cloud server on new disks. Most affected services (excluding wikis) are fully functional once again. We are going to send the affected disks to a professional data recovery service to see what can be done. While costly, we thank each one of our donors who has supported us along the way. If all goes well, the earliest estimate for affected wikis coming back online is early next week.
 * What is the plan for now?

Our number one priority at this moment is restoring wikis. About 500 open public wikis are affected by this so we understand this has certainly caused an impact for many of Miraheze's users. Rest assured we have not forgotten about those wikis. Every one of our 5,500+ wikis is important so we are working very hard to restore these wiki's data and bring them back online. We are so grateful that for the patience our users have had before this unprecedented issue. We will be posting updates here so please stay tuned. If you have any questions, please join us on our Discord. Thank you.