Back to overview

Internal Issues with GitLab related to login and commit activity

Oct 10 at 03:52pm CEST
Affected services
gitlab.eclipse.org

Resolved
Oct 14 at 03:07pm CEST

Postmortem: GitLab Outage Due to OS-Level Firewall Change

Incident Date: Monday, October 6, 2025

Duration: ~7 days (October 6–13)

Impact: GitLab Workhorse nodes unable to reach external services, resulting in degraded authentication, CI/CD pipeline performance and failed external integrations.


Summary

On October 6, 2025, an operating system update was applied to GitLab Workhorse nodes as part of routine patching. The update included a newer version of firewalld (≥ 0.9.11), which removed support for the deprecated firewalld.direct interface. This change silently invalidated the SNAT rules, severing outbound connectivity from worker nodes to the public Internet.

The issue went undetected until October 10, when CI jobs and GitLab authenticated operations began failing due to unreachable external endpoints (e.g., authentication and webhook targets). Investigation revealed that the SNAT configuration was no longer active, and the underlying cause was the removal of the direct passthrough mechanism from firewalld.


Timeline

Date Event
Oct 6 OS patch applied to GitLab Workhorse nodes
Oct 6-10 SNAT rules silently ignored due to removal of firewalld.direct.
Oct 9-10 Failures reported; initial investigations.
Oct 13 Root cause identified: deprecated passthrough rules no longer processed.
Oct 13 SNAT rules migrated to nftables and validated.
Oct 13 Full service restoration confirmed.

Root Cause

The incident stemmed from a failure to detect and act on the deprecation and removal of firewalld.direct in the updated OS. The IT team did not review the changelog or validate firewall rule persistence post-upgrade. As a result, critical SNAT rules were silently dropped, isolating GitLab Workhorse nodes from the Internet.


Resolution

  • Migrated SNAT rules to a persistent nftables configuration.
  • Validated outbound connectivity and authentication / CI/CD pipeline health.
  • Updated internal documentation to reflect firewall rule migration.
  • Document post-upgrade check for SNAT functionality.

Preventive Actions

  • Monitoring: Added outbound connectivity checks to GitLab health probes.
  • Process: Added process to validate outbound connectivity post-upgrade.
  • Environment: Set up a staging environment to build and test GitLab/OS upgrades and new features before deploying to production.

Lessons Learned

  • Silent deprecations in infrastructure tools (like firewalld) can have outsized impact.
  • Post-upgrade validation must include functional checks, not just service status.

Created
Oct 10 at 03:52pm CEST

We are experiencing an incident where login is unavailable and Git activity is non-operational within the hosted GitLab service. We are investigating the cause currently.