projects.eclipse.org and marketplace.eclipse.org are not operational
Incident Report for Eclipse Foundation Services
Postmortem

On Monday November 22 hosts in the group of machines hosting marketplace.eclipse.org, projects.eclipse.org, newsroom.eclipse.org and blogs.eclipse.org began to become unresponsive due to exceptionally high system load.

The Webmaster team was investigating by 8:45 AM Eastern, and initially only some of the hosts in the group had become unresponsive, so we decided to restart them. Of course this lead to the functioning hosts slowly becoming unresponsive as well and then the load being directed back at the machines that had just been restarted.

While the above was happening we were able to gain access, and all we could see was that the system OOM-Killer was running against the web server process, which forced the kernel to begin swapping and basically ground the hosts to a halt.

Examination of traffic into our web gateway didn’t show any obvious signs of a DoS like attack.

At 10:00 AM we began by increasing the RAM allocated to the hosts for these sites, which provided some temporary relief. Again the system and webserver logs provided no insight into what was causing the resource consumption.

By 10:45 AM the team chose to try and restore service by separating traffic for the sites into to 2 groups of hosts, and adding an extra host for projects and marketplace in order to spread the load out further. A new host was created and by 11am the load had begun to stabilize and the sites were returning to active service.

With nothing obvious in the logs the team presumed that there had been some kind of DoS style event and that it had abated.

However the problem began to reoccur at 8pm Monday and by Tuesday the 23rd services were again mostly unavailable. Webmasters again began to check logs where possible finding nothing more than we had on Monday. By 9AM Eastern on the 23rd, Webmasters had doubled the RAM for the affected hosts again, and had taken to restarting nodes when they became completely unresponsive.

Checks of the incoming traffic didn’t indicate excessive requests or abuse, so again a DoS attack seemed unlikely

As an emergency attempt to keep the hosts from needing to be restarted Webmasters cut down on the maximum number of web server processes that could spawn and the amount of resources they could consume. This restored some service, however it was pretty degraded.

At this point the Webmaster and WebDev teams were both looking into this issue by increasing the verbosity of the logs, and adding extra debugging telemetry. With this increased data Webmasters were able to see 10-15 requests a minute that were returning a 500 due to a fatal PHP error, Once the incoming requests had been correlated to those returning the 500, the WebDev team was able to determine that a change in PHP version was breaking some previously functional code. A patch was created and deployed at which point the load began stabilizing, and service started returning to normal. By 12 PM on the 23rd Webmaster felt that the service had returned to operational status, although we continued to monitor it.

After talking to the Releng team we think the trigger was a validation job that runs at 8pm, which when combined with the broken code was causing a large increase in the number of internal requests, so in effect the hosts were DoSing themselves.

Posted Nov 24, 2021 - 10:15 EST

Resolved
This incident has been resolved.
Posted Nov 24, 2021 - 08:52 EST
Update
We are continuing to monitor the fix
Posted Nov 23, 2021 - 15:14 EST
Monitoring
A fix has been implemented and we are monitoring the results
Posted Nov 23, 2021 - 13:24 EST
Update
We are continuing to investigate this issue.
Posted Nov 23, 2021 - 09:09 EST
Investigating
We are currently investigating this issue.
Posted Nov 23, 2021 - 09:07 EST
This incident affected: API (projects.eclipse.org) and Core Services (marketplace.eclipse.org).