Internal Issues with GitLab related to login and commit activity
Resolved
Oct 14 at 03:07pm CEST
Postmortem: GitLab Outage Due to OS-Level Firewall Change
Incident Date: Monday, October 6, 2025
Duration: ~7 days (October 6–13)
Impact: GitLab Workhorse nodes unable to reach external services, resulting in degraded authentication, CI/CD pipeline performance and failed external integrations.
Summary
On October 6, 2025, an operating system update was applied to GitLab Workhorse nodes as part of routine patching. The update included a newer version of firewalld
(≥ 0.9.11), which removed support for the deprecated firewalld.direct
interface. This change silently invalidated the SNAT rules, severing outbound connectivity from worker nodes to the public Internet.
The issue went undetected until October 10, when CI jobs and GitLab authenticated operations began failing due to unreachable external endpoints (e.g., authentication and webhook targets). Investigation revealed that the SNAT configuration was no longer active, and the underlying cause was the removal of the direct
passthrough mechanism from firewalld.
Timeline
Date | Event |
---|---|
Oct 6 | OS patch applied to GitLab Workhorse nodes |
Oct 6-10 | SNAT rules silently ignored due to removal of firewalld.direct . |
Oct 9-10 | Failures reported; initial investigations. |
Oct 13 | Root cause identified: deprecated passthrough rules no longer processed. |
Oct 13 | SNAT rules migrated to nftables and validated. |
Oct 13 | Full service restoration confirmed. |
Root Cause
The incident stemmed from a failure to detect and act on the deprecation and removal of firewalld.direct
in the updated OS. The IT team did not review the changelog or validate firewall rule persistence post-upgrade. As a result, critical SNAT rules were silently dropped, isolating GitLab Workhorse nodes from the Internet.
Resolution
- Migrated SNAT rules to a persistent
nftables
configuration. - Validated outbound connectivity and authentication / CI/CD pipeline health.
- Updated internal documentation to reflect firewall rule migration.
- Document post-upgrade check for SNAT functionality.
Preventive Actions
- Monitoring: Added outbound connectivity checks to GitLab health probes.
- Process: Added process to validate outbound connectivity post-upgrade.
- Environment: Set up a staging environment to build and test GitLab/OS upgrades and new features before deploying to production.
Lessons Learned
- Silent deprecations in infrastructure tools (like firewalld) can have outsized impact.
- Post-upgrade validation must include functional checks, not just service status.
Affected services
Created
Oct 10 at 03:52pm CEST
We are experiencing an incident where login is unavailable and Git activity is non-operational within the hosted GitLab service. We are investigating the cause currently.
Affected services