On June 23, 2021, at approximately 17:50 EDT/23:50 CET, a core part of our OKD cluster’s networking was damaged during the cleanup of a test deployment of a new service. This was noticed within minutes and a known good copy of the configuration was reloaded in an attempt to restore service. However that was not possible, at which point other members of the IT team were called in to assist in the recovery.
After more investigation, around 19h30 EDT/1:30 CET, it was decided to attempt to gracefully restart the cluster to try and clear the current state and roll out the corrected configuration. After one hour, having failed to resolve the issue, it was decided to call the current cluster a loss, and focus was shifted to backing up as much live data as possible before rebuilding the cluster from the ground up.
European and North American teams worked through their respective nights to restore service. Once the cluster was rebuilt we experienced some issues restoring network connectivity due to differences in the version of the cluster software before and after the rebuild. All issues were solved by 4:53 EDT/10:53 CET the morning of June 24, and services started coming back online.
The IT team has been investigating deploying a Ceph network storage cluster on top of OKD via the Rook project. As part of that we had been attempting to route storage replication data over a separate network per the official docs. When this was done the Ceph deployment went out of sync, and after some initial troubleshooting produced no success it was decided to remove the Rook/Ceph deployment and start over.
As part of that, the replication network that had been created was deleted via the kubectl delete command. This had the unexpected effect of removing the OKD network cluster operator which then disabled the clusters internal network which was the core of the outage.
All times in EDT.
17:50 Rook/Ceph deployment cleanup begins
17:55 Cluster network operator is noticed as being damaged. Investigation begins
18:10 First attempt is made to recover the network operator from a local copy of the configuration file
18:20 It becomes apparent that the fix has not worked
18:30 Other members of the IT team are contacted to provide assistance in resolving the issue
18:45 The cluster is determined to be badly damaged. Notice sent to cross-projects-issues-dev and an event created on eclipsestatus.io.
21:00 All attempts to restore the cluster have proven ineffective. It is decided to make a last attempt to restore service by doing a graceful shutdown and restart of the cluster
21:15 It becomes clear that the restart has not helped. The cluster is declared lost and plans begin to perform a complete rebuild. All on-hands staff begin working towards this.
23:10 The IT team finishes extracting information from the current cluster. Rebuild of the cluster begins.
00:04 Initial cluster rebuild with 3 masters and 3 workers completed.
00:43 All attempts to start a test service have failed. It’s decided to keep adding nodes to the cluster and debug the test service when more staff are on hand
2:06 Nodes continue to be added to the cluster and the test service is now running. This demonstrates that there is still an issue with getting network traffic into the cluster.
2:46 Work continues to resolve the ingress network issues as more nodes are added.
3:12 Ingress controller issue continues to cause problems. Various solutions are being tried.
3:30 The IT team believes it has found the issue with the ingress controller, however fixing it may require yet another cluster rebuild. We continue to look for other solutions.
3:46 The team discusses whether or not to fail over to our old OpenShift Cluster, and chooses not to due to potential software version compatibility issues.
4:53 The ingress controller issue is resolved. Newer versions of OKD handle ingress in a different way and so previous configurations are no longer required. The test service is now accessible from the internet.
5:25 Basic services start coming online starting with ECA validation
6:39 Repo.eclipse.org is restored to service.
6:50 Starting from now more services continue to be re-added to the cluster and restored.
14:00 The final 3 nodes that have not been added to the cluster are re-added after manual intervention on the consoles.
17:11 The cluster is declared operational and all core services are now back online. The eclipsestatus.io incident is closed to signal that the IT team considers this resolved