Postmortem: Investigation
On June 24, 2026, 2:54 PM CET the engineering team identified an issue affecting the gateway infrastructure responsible for routing customer traffic. As a result, some requests were directed to service instances that were no longer available, causing intermittent request failures.
The investigation determined that the gateway controller responsible for managing traffic routing did not consistently refresh its routing configuration following infrastructure changes, such as deployments or platform scaling events. This caused outdated routing information to remain active on some gateway instances, resulting in traffic being sent to endpoints that were no longer serving requests.
The issue affected both West Europe (Production 0) and West Europe (Production 5) during separate deployment activities.
Mitigation
The engineering team immediately initiated mitigation by restarting the gateway instances responsible for traffic routing. Restarting these components forced the routing configuration to refresh, allowing traffic to be directed to the correct service instances.
The mitigation successfully restored normal traffic routing for:
- West Europe (Production 0): 2:54 PM CET – 3:08 PM CET
- West Europe (Production 5): 3:22 PM CET – 3:42 PM CET
Following the incident, the engineering team worked closely with the gateway vendor to investigate the observed behavior and evaluate long-term remediation options.
Resolution
The issue was resolved after the affected gateway instances were restarted and the routing configuration was successfully refreshed.
As an additional precaution, the gateway component was subsequently downgraded to a previously proven stable version while the investigation with the vendor continues.
Post-Incident Actions
The engineering team completed a detailed review of the incident together with the gateway vendor to better understand the observed behavior.
To reduce the likelihood of similar incidents, we have:
- Reverted to a stable gateway controller version while the vendor continues its investigation.
- Increased monitoring of gateway configuration synchronization during deployments and scaling events.
- Continued collaboration with the vendor to validate future gateway releases before production rollout.
Impact and Scope
The incident affected a subset of customer requests hosted in the following production clusters:
- West Europe (Production 0): June 24, 2026, from 2:54 PM CET until 3:08 PM CET
- West Europe (Production 5): June 24, 2026, from 3:22 PM CET until 3:42 PM CET
During these periods, some requests failed because traffic was temporarily routed to service instances that were no longer available. Once the gateway configuration was refreshed, normal service resumed.
We sincerely apologize for the disruption caused by this incident. Providing a reliable and dependable service remains our highest priority, and we are continuing to work closely with our technology partners while strengthening our deployment and monitoring processes to help prevent similar issues in the future.