Websites are unreachable with backend storage issues
Incident Report for Eclipse Foundation Services
Postmortem

On Saturday, July 31 2021, around 21:30 EDT, eclipse.org suffered a severe outage across all of its core services. The outage was extensive, and for core services, lasted for approximately 18 hours. Non-core services were degraded for an additional 12 hours. Some Git data was lost as a result of this failure.

External factors

At the time of the outage, the Eclipse Foundation had entered a week-long shutdown period. Infra staff was either on call or on vacation, and with varying degrees of Internet accessibility.

Summary

Many Eclipse.org services, including core services like www.eclipse.org, Bugzilla and Git/Gerrit, as well as non-core websites all rely on a pair of NFS file servers for backend storage. One server acts as primary, and a secondary server receives periodic (batched) updates. The failover process is entirely manual.

This Single Point of Failure is well known to the Infra team. Plans to replace it have been ongoing for years.

On Saturday, July 31 around 21:30 EDT, the xfs filesystem on the primary NFS server encountered a deadlock with the 45TB volume hosting data for all the aforementioned services, which led to all CPU cores being deadlocked by unresponsive I/O. On Sunday, Aug 1, at 03:46 EDT, Mikaël Barbero(EF-EU) discovered the down state of these services, and immediately updated our status page, and called the Infra team(EF-CA) in to investigate. Minutes later, the server problem was established, but the path to recovery was unclear.

Early focus was placed on restarting the failed server, since the primary and secondary NFS servers also replicate for user authentication. As the infra team was working remotely, it took an extended period of time to reach the console of the failed server. At 07:51, we established that the cause of the failure was a deadlock state and proceeded to restart the server.

Restarting failed, as data corruption was present on the data volume. xfs_repair was invoked, with numerous errors being reported and fixed. At 08:46, the xfs_repair tool itself crashed (core dump) and destroyed the file system it was repairing.

Focus then shifted towards creating a new volume (using a different file system) and back-filling data from the secondary, rather than failing over to the secondary server to restore service. We were aware this could extend the outage by several hours, but we knew the end state would be better (from a data protection perspective) than failing over, breaking sync, and having a single secondary server with no backup.

At 09:03, the new volume was created, and we established that LDAP user syncronization between primary and secondary servers was intact, as LDAP data is on a separate, unaffected volume. At 09:17, data sync had begun and it was estimated that 13 hours minimum would be required at full wire speed.

During the outage, we made every effort to provide periodic updates to the statuspage – www.eclipsestatus­.io – with accurate state information as we had it.

At 16:17 Aug 1, almost 19 hours after the initial server failure, www.eclipse.org was responsive, and serving its normal content. As the sync progressed, more and more services became available, but the process was very time consuming.

By 16:13 EDT Aug 2, all websites were assessed to be up and operational by virtue of allowing the sync to complete.

Timeline of events

All times in EDT.

July 31 21:30: primary NFS server experiences an I/O deadlock, and it locks all the unit’s cpu cores. All core services, and many of the Eclipse web properties, grind to a halt as they wait for NFS I/O.

Aug 1 03:46: Mikaël discovers the outage, and minutes later, and after some exploratory work, contacts the Infra team.

04:13: Denis (on call) and Jakub (on call) from the Infra team respond.

04:22: Primary nfs server is determined to be the culprit. It is decided that the best path to recovery was to restart the failed server, attempting to preserve as much synchronization as possible.

04:28: The only server accessible to the Infra team is missing some remote access tools, as this was not it’s intended function

05:06: Required tools are installed with much struggle. Cannot access the downed servers’s IPMI interface due to unavailable secure credential storage on a non EF laptop.

05:39: A ticket is filed with Rogers Datacentres(our colo provider) for remote hands to restart the server, as we are unsuccessful using IPMI. Jakub mentions that it’s mission critical.

06:15: It becomes clear that the Datacenter facility does not have on-prem staff, and that someone is on their way, with no ETA provided.

07:02: Planing begins to failover to the secondary server, even though it’s not the best long-term choice.

07:32: Rogers will have someone at the Datacenter “ETA 1 hour” (two hours after the initial call)

07:35: It’s decided to wait the 1 hour for remote hands, rather than to attempt failover

07:43: Jakub and Mikaël locate the required credentials in our secure store

07:51: IPMI connection is established, cpu lockup confirmed on the failed server. The server is restarted.

08:03: xfs corruption confirmed on the 45TB data volume; we begin attempting repair..

08:46: xfs_repair core dump, filesystem appears permanently damaged. We create a new filesystem and decide to back-sync data.

09:03: LDAP user account replication is fully functional, LDAP service resumes HA operation.

09:13: Data sets for core services begin priority sync to return these to normal function faster. We estimate sync time to be at least 13h.

16:17: NFS service on primary server is started. www.eclipse.org recovers. Bugzilla partially recovers. Jakub sends an email to (list_unknown). Data sync continues, but there are 9.2TB of data to copy. It’s is a lengthy process.

Aug 2 00:42: Most services are fully restored. download.eclipse.org is mostly complete, but is still missing critical data.

08:55: 24 hours after sync has begun, 2TB remain.

12:34: Marketplace, Blogs, EclipseCon data still incomplete.

14:46: most data has been replicated, and the outage is over. Over the next few days, all teams (Infra, Releng, WebDev) work out kinks and glitches associated to the outage.

Aug 4 10:21: Most of the remaining issues have been fixed. A couple of VMs will need more work, but projects have been notified when they are concerned.

Aug 5 10:11: Denis sends email to eclipse.org-committers mailing list, to communicate the outage, the fix, and the potential lost data: “The areas where data loss would be noticeable are in the Git repos and mailing list archives. Website data, such as that on www.eclipse.org, bugzilla, Wiki are not affected.

As Git is decentralized, there should be no real data loss, and we ask all projects hosting on Gerrit to re-sync their repos with Eclipse Gerrit. If the ECA validator complains that you're trying to push changes that are not yours, please file a bug/issue and we'll work with you to re-merge your repos.”

Key observations

  • This outage was caused by a Single Point of Failure component. This risk was known to the team, but the component in question has proven itself to be robust for over a decade, and plans to eliminate that Single Point of Failure are costly to implement.
  • The outage was caused by kernel-level software, not hardware. It is arguably among the worst types of outages at the worst possible time (ie, weekend, vacation, office shutdown, when staff levels of alertness are at their lowest).
  • Although external monitors alerted of the outage as soon as it happened, the alerts went to a Slack channel that we typically only observe during work hours. Recovery time could have been much faster had we been alerted of the outage when it happened.
  • As a matter of circumstance, the only machine available to log in was a newly implemented bastion host, which lacked one critital tool.
  • Our Datacenter “Remote Hands” service is not 24/7. As we never use this service, we were unaware of the time it would take for remote hands to become available.
  • The outage was prolonged by the sheer size of the dataset on a single data volume.
  • Having an updated “We are completely down” page, with a link to the Status page and Twitter feed, was likely useful (as opposed to a timeout)
  • The Infra & Releng teams did a remarkable job of coordinating work and ideas, despite the circumstances faced.

Key recommendations

  • External monitoring software needs to communicate outage status to the Infra team via a non-Slack channel (eg, SMS text), especially when it happens outside of normal hours.
  • We need to remove the Single Point of Failure NFS dependency. We’ve been planning on a modern, resilient storage cluster for years, and will advocate this becoming a priority.

    • If NFS cannot be omitted, we can consider creating smaller volumes with smaller, targeted data sets.
    • Alternatively, having multiple NFS servers, each serving service-specific data, could also be considered. Although this increases the number of Single Points of Failure, and the odds of having an outage, it would avoid the “All Eggs in One Basket” situation that causes a total outage.
  • The Infra team needs a better process for notifications durring vacation & office shutdown times.

  • We need to better communicate outages, and mostly, the fallout of such outages. Denis sent an email to eclipse.org-committers with valuable information on Aug 5, which was a full 3 days after recovery. That email should have been sent on Aug 2.

Posted Aug 09, 2021 - 22:36 EDT

Resolved
All outstanding issues have been resolved. As there was data loss, all Gerrit-hosted projects are asked to re-sync their workspaces with Gerrit, and to push any missing commits on git.eclipse.org. We will work with projects to ensure Git commits are restored, and CI systems are working as expected.
Posted Aug 09, 2021 - 13:40 EDT
Update
Most of the remaining issues have been fixed. A couple of VMs will need more work, but projects have been notified when they are concerned.
Posted Aug 04, 2021 - 10:21 EDT
Update
Access to projects-storage.eclipse.org has been restored on all CI instances.
Posted Aug 04, 2021 - 08:50 EDT
Update
There are still some potential fallouts from the outage. We're working on getting it back together.
Posted Aug 04, 2021 - 08:03 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 02, 2021 - 14:46 EDT
Update
Most websites are back online, the sync goes on for the remaining ones.
Posted Aug 02, 2021 - 02:46 EDT
Update
We are continuing to work on a fix for this issue.
Posted Aug 01, 2021 - 10:38 EDT
Update
Issue is coming from our main storage backend that requires a full resynch in order to be in working condition again. Given the data size, it will take about 13 hours. Thanks for your patience. We will keep you posted.
Posted Aug 01, 2021 - 09:37 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 01, 2021 - 04:48 EDT
Investigating
We are currently investigating this issue.
Posted Aug 01, 2021 - 03:50 EDT
This incident affected: Others (blogs.eclipse.org), Core Services (www.eclipse.org, download.eclipse.org, bugs.eclipse.org, git.eclipse.org/c, git.eclipse.org/r, wiki.eclipse.org, accounts.eclipse.org, archive.eclipse.org, marketplace.eclipse.org, www.eclipsecon.org, www.eclipse.org/forums), and API (api.eclipse.org).