OKD Cluster - outage
Incident Report for Eclipse Foundation Services
Postmortem

OKD Cluster Outage June 23 2021

Summary

On June 23, 2021, at approximately 17:50 EDT/23:50 CET, a core part of our OKD cluster’s networking was damaged during the cleanup of a test deployment of a new service.  This was noticed within minutes and a known good copy of the configuration was reloaded in an attempt to restore service.  However that was not possible, at which point other members of the IT team were called in to assist in the recovery.

After more investigation, around 19h30 EDT/1:30 CET, it was decided to attempt to gracefully restart the cluster to try and clear the current state and roll out the corrected configuration.  After one hour, having failed to resolve the issue, it was decided to call the current cluster a loss, and focus was shifted to backing up as much live data as possible before rebuilding the cluster from the ground up.

European and North American teams worked through their respective nights to restore service. Once the cluster was rebuilt we experienced some issues restoring network connectivity due to differences in the version of the cluster software before and after the rebuild.  All issues were solved by 4:53 EDT/10:53 CET the morning of June 24, and services started coming back online.

Root Cause

The IT team has been investigating deploying a Ceph network storage cluster on top of OKD via the Rook project.  As part of that we had been attempting to route storage replication data over a separate network per the official docs.  When this was done the Ceph deployment went out of sync, and after some initial troubleshooting produced no success it was decided to remove the Rook/Ceph deployment and start over.

As part of that, the replication network that had been created was deleted  via the kubectl delete command. This had the unexpected effect of removing the OKD network cluster operator which then disabled the clusters internal network which was the core of the outage.

Recommendations

  1. We need a better backup strategy for the core configuration components of the cluster.  While team members did have individual copies or parts of this, there was no centralised ‘known good’ backup that we could turn to.  This also applies to running services that have configuration details.
  2. Create a blue/green cluster pair  to allow us to stage cluster upgrades and demo/prototype on a non-production cluster.
  3. Build a standalone storage cluster.  While running such a service on top of OKD was initially tempting, this incident has demonstrated that should another such event occur (regardless of cause) we could not only lose data, but the recovery time would be that much longer.
  4. More configuration as code.  This isn’t directly related to this event, but it became clear that covering more of our operations and configuration changes in source control could make recovery/troubleshooting easier and increase the number of staff that can assist in recovery.

Timeline of events

  All times in EDT.

June 23

17:50  Rook/Ceph deployment cleanup begins

17:55  Cluster network operator is noticed as being damaged.  Investigation begins

18:10  First attempt is made to recover the network operator from a local copy of the configuration file

18:20  It becomes apparent that the fix has not worked

18:30  Other members of the IT team are contacted to provide assistance in resolving the issue

18:45  The cluster is determined to be badly damaged.  Notice sent to cross-projects-issues-dev and an event created on eclipsestatus.io.

21:00   All attempts to restore the cluster have proven ineffective.  It is decided to make a last attempt to restore service by doing a graceful shutdown and restart of the cluster

21:15  It becomes clear that the restart has not helped.  The cluster is declared lost and plans begin to perform a complete rebuild.  All on-hands staff begin working towards this.

23:10  The IT team finishes extracting information from the current cluster.  Rebuild of the cluster begins.

Jun 24

00:04  Initial cluster rebuild with 3 masters and 3 workers completed.

00:43  All attempts to start a test service have failed.  It’s decided to keep adding nodes to the cluster and debug the test service when more staff are on hand

2:06   Nodes continue to be added to the cluster and the test service is now running.  This demonstrates that there is still an issue with getting network traffic into the cluster.

2:46  Work continues to resolve the ingress network issues as more nodes are added.

3:12  Ingress controller issue continues to cause problems.  Various solutions are being tried.

3:30  The IT team believes it has found the issue with the ingress controller, however fixing it may require yet another cluster rebuild.  We continue to look for other solutions.

3:46  The team discusses whether or not to fail over to our old OpenShift Cluster, and chooses not to due to potential software version compatibility issues.

4:53  The ingress controller issue is resolved.  Newer versions of OKD handle ingress in a different way and so previous configurations are no longer required.  The test service is now accessible from the internet.

5:25  Basic services start coming online starting with ECA validation

6:39  Repo.eclipse.org is restored to service.

6:50  Starting from now more services continue to be re-added to the cluster and restored.

14:00  The final 3 nodes that have not been added to the cluster are re-added after manual intervention on the consoles.

17:11  The cluster is declared operational and all core services are now back online.  The eclipsestatus.io incident is closed to signal that the IT team considers this resolved

Posted Jun 25, 2021 - 11:16 EDT

Resolved
All services are back online.
Posted Jun 24, 2021 - 17:11 EDT
Update
open-vsx.org has been restored a couple of hours ago and most Jenkins are back online already. We are continuing to work to bring all of them back.
Posted Jun 24, 2021 - 15:39 EDT
Update
Most services have been restored. https://open-vsx.org is next.

Jenkins instances are all coming back online one by one.
Posted Jun 24, 2021 - 11:24 EDT
Update
We are continuing to work on a fix for this issue.
Posted Jun 24, 2021 - 08:43 EDT
Identified
Network traffic now seems to be flowing. Services will be started slowly.
Posted Jun 24, 2021 - 05:01 EDT
Update
Cluster rebuild appears to have been successful. We are continuing to work on this issue
Posted Jun 24, 2021 - 00:52 EDT
Update
We have determined that the cluster requires a complete rebuild. We will attempt to restore service as soon as possible.
Posted Jun 23, 2021 - 20:45 EDT
Update
We are continuing to investigate this issue.
Posted Jun 23, 2021 - 19:35 EDT
Investigating
We are currently working on a networking issue within our OKD cluster. This affects all hosted services
Posted Jun 23, 2021 - 18:42 EDT
This incident affected: Working Groups Websites (ecdtools.eclipse.org, edgenative.eclipse.org, events.eclipse.org, iot.eclipse.org, lts.eclipse.org, jakarta.ee, jakartaone.org, openadx.eclipse.org, openhwgroup.org, openmdm.org, openmobility.eclipse.org, openpass.eclipse.org, osdforum.org, science.eclipse.org, sparkplug.eclipse.org, tangle.ee, www.eclipse.org/org/research, open-vsx.org), Others (blogs.eclipse.org, planeteclipse.org), CBI (repo.eclipse.org, ci.eclipse.org, help.eclipse.org), and Core Services (git.eclipse.org/c).