On Tuesday, August 15, a notification was received that a disk array on a storage server was in a degraded state due to a failed disk. This storage device is used by the compute cluster, specifically for Nexus, some websites and all the CI instances. An internal ticket was filed, and a new disk was sourced but was not immediately available.
A degraded array remains functional, but its performance is reduced as the I/O invests more time in ensuring data protection. In fact, as the device is protected with a RAID Level 6, the loss of a single disk is not as critically important as it is with RAID Level 1 or 5.
On August 16 at 14:45 GMT+2 the storage device became unresponsive, which led to some of the services listed earlier to become unavailable. The unresponsive device was accessed remotely and restarted. It was determined that the device’s I/O was overwhelmed, which led to a kernel lockup. The device’s filesystem was, however, damaged in the event, and required a non-interactive (automatic) repair.
With the device back online, services were being restored with the exception of Nexus at repo.eclipse.org, which refused to start. It was determined that the config.xml file was corrupted, and the Lucene indexes were all damaged. We concluded that the reason for the device’s crashing was partly due to Nexus reindexing itself at a time when I/O usage was already very elevated.
The infra team sourced the backed up file from the backup archives, and was able to locate an older configuration file from 2015. Since restoring from the backup archives is a slow process, we started the Nexus instance with the 2015 configuration file. The service was started and with minimal assessment, it was determined to be normal.
The following day, on August 17, after receiving numerous reports of missing content in Nexus, it was clear the configuration file was too old and was missing critical information. We sourced a copy from the backup archives, cleared the Lucene indexes and restarted the service was finally restored in its entirety.
Replacing a disk requires a large amount of I/O while the array rebuilds itself. It was determined that replacing the disk during peak demand could have caused further outage; thereore the activity was scheduled for the following Saturday when I/O demand would be much lower. For the remainder of the week, the team actively monitored I/O demand and intervened as needed to ensure the device was not overwhelmed once again.
The failed disk was replaced Saturday, Aug. 19 and the array was in Optimal state approximately 25 hours later.