Back to overview
Downtime

Storage issues

Apr 23 at 05:32pm CEST
Affected services
Core Services
Secondary Services
API
CBI

Resolved
May 16 at 05:40pm CEST

At this time we are returning to normal operations and continuing to recover CI instances.

Updated
May 13 at 04:12pm CEST

Storage Cluster Outage post-mortem

Summary

On April 23, 2025, a high-impact outage occurred with the Eclipse Foundation’s storage cluster. The affected system is an eight-node Ceph-based cluster that supports several services, including CBI (which comprises 230 Jenkins instances and a Nexus repository, both classified as Tier II services), as well as a few Tier I services such as user authentication.During this outage, our CBI services were heavily impacted, with the outage not finally being completely addressed until May 8, 2025. Tier I services (including the eclipse.org website, GitLab, and downloads) were largely unaffected during the incident.

Before the outage, the team had observed slow cluster deployments. In attempting to resolve these issues, several actions unintentionally damaged the storage cluster. As a result, the Tier I and Tier II services that relied on it went offline. After multiple unsuccessful repair attempts, the team temporarily restored core services by switching to legacy data services. Recovery for Tier I services proceeded smoothly. For Tier II services such as Jenkins and Nexus, however, only older backups were available. These were used for initial recoveries and, while outdated, had minimal impact.

The Infra team retained the services of a third-party team that specializes in the technology behind the storage cluster to assist with recovery.

On April 28, the Foundations IT team began offering to restore CBI services from backups, and from the recovered production data, in order to better minimise the ongoing disruption.

Timeline Summary

Outage start: April 23, 9:30AM EDT
First recovery (Tier I): April 24, 2:56 PM (T plus ~30 hours)
Final recovery (Tier I): April 25, approximately 4:00PM (T plus ~ 2 days)
Most Active CI Recovery: May 7, EOB (T plus ~ 15 days)

A more detailed timeline of events is listed at the end of this report.

Root Causes

It is the Infra team’s opinion that the following are the most likely reasons for this:

  1. Network Misconfiguration
    The cluster had been operational for over one year, and was put through rigourous tests to assess its ability to endure multiple hardware failures under heavy loads. However, these tests did not reveal the concealed network misconfiguration, which disabled the entire system..

  2. The issues reported on April 17 were not fully diagnosed or escalated in a timely manner.
    This was influenced by known issues in the interplay of our storage and compute clusters and SELinux, which was misidentified as the initial cause.

  3. Insufficient monitoring of storage cluster performance details.
    While the infra team does monitor the storage cluster, it’s now clear that our general monitoring is not granular enough to spot issues with MDS services.

  4. Not enough storage separation.
    Currently 90% of our compute workload shares storage with our core services which increases the time required to recover as well as the impact of storage issues.

  5. Insufficient experience with the storage system.
    The Infra team had limited experience with this storage solution.

Corrective Actions

Our recommendations to help prevent this kind of issue in the future, and to lessen the impact of such an incident, are:

  • Augment the IT team with at least one SME resource for storage, with the task of overseeing deployment and architecture, fault-tolerant operations, data backups, disaster recovery, performance monitoring and alerting, and outage contingency procedures.

  • Separate core service storage from general workload storage.

  • Implement more granular monitoring and alerting for the storage cluster and IT services in general.

  • Engage an external expert to review both our storage and compute cluster architecture and recommend improvements.

  • Improve disaster resilience by improving backups and fall-back plans.

Looking Ahead

The Eclipse Foundation remains committed to transparency, operational excellence, and continuous improvement. The steps outlined here will help ensure greater resilience and reduce the likelihood of similar incidents in the future.

Timeline Of Events - Detailed

April 17, 2025 8:43 AM EDT - WebDev team opens a ticket reporting issues deploying a new build of projects.eclipse.org. The Infra team investigates, but cannot pinpoint the specific issue. The deployment eventually succeeds and investigation ends.

April 22, 2025 1:57 AM - RelEng team reports issues deploying workloads in our compute cluster. Previous resolution to similar issues is to reboot affected nodes, which is done by the Infra team.

April 22, 2025 2:30 PM - Infra team is notified that a new deployment of accounts.eclipse.org has failed to start.

April 22, 2025 8:47 PM - Infra team reports that accounts.eclipse.org is back up. Investigation points at interactions between SELinux permissions and the storage cluster (known issue). Infra team deploys a vendor suggested mitigation.

April 23, 2025 3:19 AM - RelEng team reports storage cluster metrics are showing limited availability and lag.

April 23, 2025 9:11 AM - Infra team begins investigating, the cluster's metadata server (MDS) component is identified as the possible culprit.

April 23, 2025 9:30 AM - Infra team attempts to deploy another MDS process to increase availability, however this causes increased load and starts impacting other storage cluster services.

April 23, 2025 10:11 AM - Infra team begins attempting to block new compute jobs from starting, while also restarting clients to try and break the storage cluster deadlock.

April 23, 2025 1:36 PM - Infra team manages to temporarily restore the storage cluster, this is short lived due to process crashes of other components.

April 23, 2025 4:36 PM - Storage cluster is still deadlocked, Infra team decides to cut the connections between the compute and storage clusters to prevent spiking load from overwhelming recovery attempts.

April 23, 2025 8:30 PM - Infra team elects to attempt a reboot of one of the storage cluster control nodes to try and break the deadlock.

April 23, 2025 9:51 PM - Infra team restarts all MDS services in an attempt to restore service.

April 23, 2025 10:53 PM - Infra team begins following cluster disaster recovery workflow per the documentation.

April 23, 2025 11:45 PM - Recovery workflow is in progress

April 24, 2025 7:05 AM - Initial recovery operations complete. Infra team begins working on next steps

April 24, 2025 8:03 AM - WebDev and RelEng teams begin failing core services back to our legacy storage service.

April 24, 2025 2:56 PM - Core services begin to come back online after being moved to our legacy data store. Attempts to recover the storage cluster are ongoing.

April 25, 2025 3:56 AM - RelEng team reports more services are back online. Infra team is encountering software crashes for daemons used by the storage cluster.

April 25,. 2025 4:15 PM - The storage cluster reports one volume has been repaired, but others remain offline.

April 27, 2025 9:56 AM - The infra team manages to start a single test volume, however not all expected daemons are created.

April 27, 2025 9:48 PM - The infra team does a full storage cluster reboot of all nodes in an attempt to clear the issue. The cluster restarts correctly, and appears to begin work recovering the outstanding filesystems.

April 28, 2025 7:15 AM - The storage cluster continues to report filesystem recovery in process, however this is not supported by IO monitoring

April 28, 2025 1:46 PM - After discussions with another IT team, the Infra team elects to hire a specialist consultant to advise on rebuilding efforts.

April 28, 2025 5:05 PM - Consultant begins analyzing cluster status.

April 28, 2025 6:40 PM - Issues identified with configuration and deployment of the storage cluster. These are resolved.

April 28, 2025 8:15 PM - Secondary issues are resolved and work begins to resolve the primary issues and rebuild filesystems.

April 28 , 2025 8:50 PM - Storage cluster is now stable and filesystem rebuild/recovery operations are in progress. At the advice of the consultant the Infra team continues to deny client access to the storage cluster, pending a re-check by the consultant.

May 1 , 2025 9:25 PM - File systems were rebuilt successfully; utilising this up-to-date data for service recovery becomes a prospective. A backup of that data is initiated.

May 4 , 2025 8:55 AM - Releng team is able to begin service restoration using recovered, up-to-date data.

Updated
May 08 at 06:54pm CEST

We're continuing to restore services from backups while working to recover our storage cluster and confirm that ti's ready to be put back into service.

Updated
May 07 at 02:24pm CEST

The backups have now completed. We're going to begin testing things via our secondary compute cluster to confirm the state of the storage cluster.

Updated
May 05 at 02:20pm CEST

We're currently working to back up the data that has been recovered, however this is taking longer than hoped.
In the mean time we'll be working to recover services by copying data from the 'damaged' storage cluster to our fallback storage service where possible.

Updated
Apr 29 at 03:02pm CEST

At this time the storage cluster appears to be stable. We are working to confirm the status of the data it held.

Updated
Apr 26 at 06:27am CEST

Most of our services have been restored to legacy storage. However, the state of the storage cluster is not good, and we will likely need to abandon it. Starting Monday in CET, the EF IT team will enact Plan B: to provision enough legacy storage and begin restoring Nexus (repo.eclipse.org) and each of the 250+ Jenkins instances so that builds can resume.

Updated
Apr 25 at 02:35pm CEST

Newsroom, Marketplace and Open-vsx were brought back online on legacy storage. Our storage cluster is still in recovery, and this is blocking CI and related services restoration. We unfortunately do not have an ETA.

Updated
Apr 24 at 11:14pm CEST

Rebuild of the backend meta data is still proceeding

Updated
Apr 24 at 09:33pm CEST

Cluster storage is slowly being restored. All Tier I services have been restored with minor degradations. Other services are being restored using temporary storage services.

Updated
Apr 24 at 03:00pm CEST

At this time we believe the primary cause has been identified, and now that we have collected snapshots we're in the process of clearing and rebuilding the storage system layers.

Updated
Apr 24 at 08:43am CEST

We are still working to resolve the issue and recover the storage system.

Updated
Apr 24 at 05:24am CEST

We are still working on this issue and do not have an ETA to recovery at this time.

Created
Apr 23 at 05:32pm CEST

We are currently experiencing issues with our sotrage backend which is preventing services from operating normally.

We are continuing to work on a resolution.