OpenVSX returns 502 'Bad Gateway' errors
Incident Report for Eclipse Foundation Services
Postmortem

Summary

On July 20th, we rolled out a new build of the open-vsx.org staging website in preparation for updating the main deployment.  After almost a day there were no errors in the logs so we decided to go live with the update to the production deployment.  A few hours later it became clear that something was causing errors on the site.  Initial indications were that some network requests were taking too long to respond.  Webmaster re-ran the build and the deployment was restarted which brought the site back online, although it started to degrade again a little while later.

At that point Webmaster attempted to roll back the deployment by retriggering the last known good build, however that did not resolve the issue as expected.  It became clear at that time that Webmaster would need to manually redeploy a known good version of the site.  Work began on determining which of the published container images were viable on the morning of the 22nd and by the early afternoon a suitable image had been found and the site was declared operational.

Root cause

While the specific root cause is not known it’s possible it’s related to either [https://github.com/eclipse/openvsx/issues/501\(UI](https://github.com/eclipse/openvsx/issues/501(UI) code sends high number of requests to backend) or https://github.com/eclipse/openvsx/commit/8c1f272b39e227a65af29fe6e7794f148aebfb68 (run resource migration job in the background).  Over the course of this outage it became clear that the update had changed the way it interacted with the backend database which caused the public network interface on the database server to begin saturating its 1Gbps uplink which then delayed other queries.  This also began to impact our ingress machines as other sites started to experience delays in network access via the Ingress process.

Overall system load on the database server and ingress nodes were high (2-3 as returned by top), but not indicative of a completely CPU bound process.

Recommendations

Improve deployment and rollback documentation.

Documentation at the time provided little to no guidance on these tasks so the Webmaster involved was left trying to ‘guess’ at the correct workflow for such activities.

Improve testing (including performance) on the staging deployment.

While the staging instance was updated prior to this event, it’s clear that the minimal amount of traffic it receives is insufficient to accurately test a new deployment.  Increasing the amount of automated testing, or providing for some kind of traffic replay from the production instance could be useful.  This also includes testing more things (the Public and User API endpoints) rather than the more limited testing that occurs now.

Get more people involved in the testing process.

Currently testing is limited to the local dev environment, but more eyes on changes when deployed to staging is clearly desirable.  The team behind vscodium uses open-vsx.org so perhaps we could make an effort  to contact them in order to test things more effectively

Improve incident communication.

Communication was strained while a fix was being pursued, but that left the community and other staff in the dark about what was happening.  The team working on the issue needs to either make a concerted effort to communicate, or to ask another uninvolved team member to handle communication efforts.

Add some more instrumentation

Currently there is little information about what is going on within the site so we are relying on external monitoring to provide insights about issues.  Adding this into the code itself could help pinpoint issues sooner.

Document (and keep up to date) who is responsible for ‘what’

Knowing who to contact when there are issues is a prime requirement, especially with an evolving infrastructure.  We should also determine which Eclipse SLA level applies to open-vsx.org, or if funding is available look at a custom SLA.

Increase the number of people participating in the project.

At this time neither the website ‘application’ or the open source project it’s based on have any dedicated developers or support staff to help improve both code quality and the release process.  While this is a perennial issue in opensource perhaps the working group can help out.

Posted Jul 29, 2022 - 14:50 EDT

Resolved
This incident has been resolved.
Posted Jul 24, 2022 - 12:42 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 23, 2022 - 14:17 EDT
Update
We are continuing to investigate this issue.
Posted Jul 22, 2022 - 11:15 EDT
Investigating
We are currently investigating this issue.
Posted Jul 22, 2022 - 04:50 EDT