Postmortem: We want to share the following Root Cause Analysis with you
Update: 25.02.2026 - Initial publishing
What happened?
On February 24, 2026, during a scheduled maintenance window intended to improve network resiliency in the FRA region ([Link]), a configuration deployment led to a loss of connectivity for public services.
Internal monitoring detected the outage shortly after the start of the announced maintenance, once the configuration change was applied. Our network engineering teams identified link failures between critical hardware components. By 20:55 UTC, a manual fix was applied to the affected devices, and initially affected services were restored by 21:00 UTC.
How could this happen? (Root Cause)
The incident was caused by a configuration state drift between our central software repository and the live hardware settings in the FR7 production environment that went undetected before the rollout.
Specifically, the outage involved Forward Error Correction (FEC) settings—parameters that allow different brands of networking hardware to communicate reliably.
- The Discrepancy: Unlike the staging environments, the production environment had unique configurations for settings required for multi-vendor hardware interoperability. Despite using a "four-eyes" principle to validate the configuration changes, the rendered output did not provide enough visibility into the unexpected discrepancy. This caused the difference to go unnoticed.
- The Trigger: The automated deployment performed a "full rebuild" of the device configuration. Because the repository did not contain the specific FEC settings (the discrepancy), it omitted them during the rebuild.
- The Result: Once the new configuration was pushed, the lack of FEC parity caused the physical links between mismatched devices to fail, dropping traffic for all public services in the region.
What are we doing to prevent recurrence?
We are committed to ensuring this specific failure mode does not happen again. Our engineering teams have initiated the following corrective actions:
- Comprehensive Configuration Audit: We are performing a full audit of all production devices to identify and resolve any "drifts" where live settings (like FEC) differ from our central repository. (to be completed within Q1 2026)
- Improved Validation Checks: We are implementing an automated pre-flight check that compares the "intended" configuration against the "running" configuration to clearly flag any potential omissions before a change is finalized. This increases the visibility of drifts and unexpected discrepancies and reduces the surface area for human error. (to be completed within Q1 2026)
The scheduled maintenance was initially planned to have only a few seconds of service disruption; however, due to the issues described, it caused a significantly higher impact. This maintenance was part of our ongoing initiative to improve stability and performance in our data centers. Our network team remains committed to driving this initiative forward. We understand the impact that this incident has caused and are working with due diligence and urgency to incorporate the lessons learned to further reduce risks during maintenance operations on our core network components.