On Saturday, July 31 2021, around 21:30 EDT, eclipse.org suffered a severe outage across all of its core services. The outage was extensive, and for core services, lasted for approximately 18 hours. Non-core services were degraded for an additional 12 hours. Some Git data was lost as a result of this failure.
At the time of the outage, the Eclipse Foundation had entered a week-long shutdown period. Infra staff was either on call or on vacation, and with varying degrees of Internet accessibility.
Many Eclipse.org services, including core services like www.eclipse.org, Bugzilla and Git/Gerrit, as well as non-core websites all rely on a pair of NFS file servers for backend storage. One server acts as primary, and a secondary server receives periodic (batched) updates. The failover process is entirely manual.
This Single Point of Failure is well known to the Infra team. Plans to replace it have been ongoing for years.
On Saturday, July 31 around 21:30 EDT, the xfs filesystem on the primary NFS server encountered a deadlock with the 45TB volume hosting data for all the aforementioned services, which led to all CPU cores being deadlocked by unresponsive I/O. On Sunday, Aug 1, at 03:46 EDT, Mikaël Barbero(EF-EU) discovered the down state of these services, and immediately updated our status page, and called the Infra team(EF-CA) in to investigate. Minutes later, the server problem was established, but the path to recovery was unclear.
Early focus was placed on restarting the failed server, since the primary and secondary NFS servers also replicate for user authentication. As the infra team was working remotely, it took an extended period of time to reach the console of the failed server. At 07:51, we established that the cause of the failure was a deadlock state and proceeded to restart the server.
Restarting failed, as data corruption was present on the data volume. xfs_repair was invoked, with numerous errors being reported and fixed. At 08:46, the xfs_repair tool itself crashed (core dump) and destroyed the file system it was repairing.
Focus then shifted towards creating a new volume (using a different file system) and back-filling data from the secondary, rather than failing over to the secondary server to restore service. We were aware this could extend the outage by several hours, but we knew the end state would be better (from a data protection perspective) than failing over, breaking sync, and having a single secondary server with no backup.
At 09:03, the new volume was created, and we established that LDAP user syncronization between primary and secondary servers was intact, as LDAP data is on a separate, unaffected volume. At 09:17, data sync had begun and it was estimated that 13 hours minimum would be required at full wire speed.
During the outage, we made every effort to provide periodic updates to the statuspage – www.eclipsestatus.io – with accurate state information as we had it.
At 16:17 Aug 1, almost 19 hours after the initial server failure, www.eclipse.org was responsive, and serving its normal content. As the sync progressed, more and more services became available, but the process was very time consuming.
By 16:13 EDT Aug 2, all websites were assessed to be up and operational by virtue of allowing the sync to complete.
All times in EDT.
July 31 21:30: primary NFS server experiences an I/O deadlock, and it locks all the unit’s cpu cores. All core services, and many of the Eclipse web properties, grind to a halt as they wait for NFS I/O.
Aug 1 03:46: Mikaël discovers the outage, and minutes later, and after some exploratory work, contacts the Infra team.
04:13: Denis (on call) and Jakub (on call) from the Infra team respond.
04:22: Primary nfs server is determined to be the culprit. It is decided that the best path to recovery was to restart the failed server, attempting to preserve as much synchronization as possible.
04:28: The only server accessible to the Infra team is missing some remote access tools, as this was not it’s intended function
05:06: Required tools are installed with much struggle. Cannot access the downed servers’s IPMI interface due to unavailable secure credential storage on a non EF laptop.
05:39: A ticket is filed with Rogers Datacentres(our colo provider) for remote hands to restart the server, as we are unsuccessful using IPMI. Jakub mentions that it’s mission critical.
06:15: It becomes clear that the Datacenter facility does not have on-prem staff, and that someone is on their way, with no ETA provided.
07:02: Planing begins to failover to the secondary server, even though it’s not the best long-term choice.
07:32: Rogers will have someone at the Datacenter “ETA 1 hour” (two hours after the initial call)
07:35: It’s decided to wait the 1 hour for remote hands, rather than to attempt failover
07:43: Jakub and Mikaël locate the required credentials in our secure store
07:51: IPMI connection is established, cpu lockup confirmed on the failed server. The server is restarted.
08:03: xfs corruption confirmed on the 45TB data volume; we begin attempting repair..
08:46: xfs_repair core dump, filesystem appears permanently damaged. We create a new filesystem and decide to back-sync data.
09:03: LDAP user account replication is fully functional, LDAP service resumes HA operation.
09:13: Data sets for core services begin priority sync to return these to normal function faster. We estimate sync time to be at least 13h.
16:17: NFS service on primary server is started. www.eclipse.org recovers. Bugzilla partially recovers. Jakub sends an email to (list_unknown). Data sync continues, but there are 9.2TB of data to copy. It’s is a lengthy process.
Aug 2 00:42: Most services are fully restored. download.eclipse.org is mostly complete, but is still missing critical data.
08:55: 24 hours after sync has begun, 2TB remain.
12:34: Marketplace, Blogs, EclipseCon data still incomplete.
14:46: most data has been replicated, and the outage is over. Over the next few days, all teams (Infra, Releng, WebDev) work out kinks and glitches associated to the outage.
Aug 4 10:21: Most of the remaining issues have been fixed. A couple of VMs will need more work, but projects have been notified when they are concerned.
Aug 5 10:11: Denis sends email to eclipse.org-committers mailing list, to communicate the outage, the fix, and the potential lost data: “The areas where data loss would be noticeable are in the Git repos and mailing list archives. Website data, such as that on www.eclipse.org, bugzilla, Wiki are not affected.
As Git is decentralized, there should be no real data loss, and we ask all projects hosting on Gerrit to re-sync their repos with Eclipse Gerrit. If the ECA validator complains that you're trying to push changes that are not yours, please file a bug/issue and we'll work with you to re-merge your repos.”
We need to remove the Single Point of Failure NFS dependency. We’ve been planning on a modern, resilient storage cluster for years, and will advocate this becoming a priority.
The Infra team needs a better process for notifications durring vacation & office shutdown times.
We need to better communicate outages, and mostly, the fallout of such outages. Denis sent an email to eclipse.org-committers with valuable information on Aug 5, which was a full 3 days after recovery. That email should have been sent on Aug 2.