Tech:SRE noticeboard

Welcome to the SRE noticeboard. This page is used to post updates from SRE, not for general help. For assistance, see the Help center.

Cloud14 issues
The cloud server (cloud14) which hosts one of our database, db141, experienced a disk issue. As a result, a small number of wikis hosted on db141 are unavailable. We have reinstalled the affected server on new disks and are working to recover the data from the affected disks. Earliest ETA of these wikis being back online is early next week. We deeply apologise for the inconvenience but rest assured we're working diligently to have this issue fixed ASAP.


 * LATEST UPDATE


 * 4AM (UTC), Tuesday, Nov. 29 - The affected disks have been shipped to Owen as of November 24th. We are still in the process of determining how to recover the data and if it is even feasible by our means. The previous update has been amended to reflect the fact that we have not yet involved a professional data recovery service as it may be prohibitively expensive to do so.


 * 2AM (UTC), Monday, Nov. 21 - We have reinstalled cloud14 and have began re-provisioning servers affected by the disk issue. Mail and IRC bots are now functional. We are working on re-provisioning servers for MediaWiki which should improve loading speeds. We are in the process of sending the disks containing db141 to Owen to review the physical disks and determine how to proceed with professional data recovery and the earliest ETA we can provide for when wikis may be back online is early next week.

A cloud server (cloud14) hosting one of our database, db141, ran into disk issues. As a result, the database cannot be accessed and some services hosted by the cloud server have been knocked offline. We have reinstalled the affected cloud server on new disks and are working to restore affected services.
 * FAQ
 * What happened?

Only wikis on db141. Affected wikis display an error saying "Wiki temporarily unavailable." Most wikis on Miraheze are fine.
 * Who is affected?

While cloud14 has been reinstalled, we will have to send the affected disks to professional data recovery. The earliest ETA for having wikis restored is potentially early next week.
 * When will this be fixed?

We are unsure. It may be possible that the disks are not actually faulty but rather that the RAID controller is which would mean your data is safe, or it's possible the actual disks have gone bad. If it is the latter, that would indicate we received a bad batch of SSDs from the manufacturer.
 * Is data loss involved?

At this moment, the only user-facing affected server is MediaWiki due to some servers being knocked offline. We are working to provision new MediaWiki servers which should fix loading.
 * What other services are affected?

We have reinstalled the affected cloud server on new disks. Most affected services (excluding wikis) are fully functional once again. We are going to send the affected disks to a professional data recovery service to see what can be done. While costly, we thank each one of our donors who has supported us along the way. If all goes well, the earliest estimate for affected wikis coming back online is early next week.
 * What is the plan for now?

Our number one priority at this moment is restoring wikis. About 500 open public wikis are affected by this so we understand this has certainly caused an impact for many of Miraheze's users. Rest assured we have not forgotten about those wikis. Every one of our 5,500+ wikis is important so we are working very hard to restore these wiki's data and bring them back online. We are so grateful that for the patience our users have had before this unprecedented issue. We will be posting updates here so please stay tuned. If you have any questions, please join us on our Discord. Thank you. Miraheze Site Reliability Engineering 00:00, 20 November 2022 (UTC)