Network gateway failover
Incident Report for Eclipse Foundation Services
Postmortem

Summary

An error in the configuration for our primary load balancer software was missed during code review, and combined with other unrelated work resulted in the primary load balancer going offline requiring the secondary to take over. However the secondary also contained the same error which required staff intervention to correct before service could be restored.

A similar version of this played out again a few hours later with the downloads service, which was also recovered by staff.

Root cause analysis

Yesterday a member of the IT staff noticed that the SSL certs used by our configuration management software to secure client/server communications had expired and thus new changes had not been deployed since 2023-05-08. As there had been no recent changes , this was deemed a low risk repair.

The solution to this SSL certificate issue was to generate a new CA cert and remove and recreate the SSL certs on the clients. As part of this process the configuration management agent on the client servers will attempt to load any pending updates that have not been previously applied.

This SSL update work was performed sequentially on both the primary and secondary gateways without issue, until an Nginx configuration error that had gone unnoticed caused the configuration management client to restart the Nginx service as the reload process(which is the usual method of updating) failed with an error. This resulted in a ‘failure’ of the primary gateway.

Once the primary had been'down for the requisite period of time the secondary server took over in an automatic attempt to restore service. Unfortunately it was suffering from the same configuration issue and so traffic was halted, until the IT staff found and corrected the error.

Several hours later the download and archive servers also restarted themselves to complete log rotations, and essentially the same issue occurred.

Recommended Mitigations

  • When updating redundant pairs, start with the secondary and verify operational status before starting work on the primary.
  • Remove ability of the configuration manangement tool to restart/reload core processes
  • Look into automated testing to try and catch recursive configuration errors before deployment
  • The secondary should confirm that it’s configuration is ‘good’ before assuming control.

Timeline

2023-05-09 2:35pm EDT A member of the IT staff updates the SSL certificates to secure the configuration management client/server interactions on both our primary and secondary gateways.

2023-05-09 2:36pm EDT The configuration management tool pulls in pending changes.

2023-05-09 2:37pm EDT A reload of our load balancer software on the primary fails due to an configuration error.

2023-05-09 2:43pm EDT Remote service monitors begin reporting outages.

2023-05-09 2:44pm EDT With the primary load balancer unresponsive the secondary assumes control and begins redirecting traffic to itself.

2023-05-09 2:49pm EDT IT staff begin investigations why the secondary is not serving traffic.

2023-05-09 2:50pm EDT The configuration error that disabled the primary is identified on the secondary and corrected.

2023-05-09 2:51pm EDT Traffic begins flowing and services begin to return to normal.

2023-05-09 3:20pm EDT An IT staff member is dispatched to the IDC to confirm the state of the primary and work through the disaster recovery process for the gateway pair.

2023-05-09 5:00pm EDT IT team declares the disaster recovery workflow complete, and begins monitoring.

2023-05-10 2:46am EDT The download and archive servers begin log rotations and restart their webservices, which fail due to an error.

2023-05-10 3:25am EDT The Releng team begins receiving reports that downloads and archive servers are offline, and begins initial investigations.

2023-05-10 4:04am EDT The Releng team contacts the on-call Infra team staff members to assist with resolution.

2023-05-10 4:44am EDT The issue has been identified, and service restored.

Posted May 10, 2023 - 21:50 CEST

Resolved
This incident has been resolved.
Posted May 10, 2023 - 21:13 CEST
Monitoring
All services should now be operational.
Posted May 09, 2023 - 23:58 CEST
Investigating
Our primary network gateway has fallen over. Our standby unit is now directing traffic, but some services may still be impacted. Staff have been dispatched to the DC to attempt to revive the failed machine.
Posted May 09, 2023 - 21:11 CEST