
InfluxDB Cloud Status
Real-time updates of InfluxDB Cloud issues and outages
InfluxDB Cloud status is Operational
API Writes
API Queries
Other
Web UI
API Writes
API Queries
Tasks
Active Incidents
We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
Investigating: We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Recently Resolved Incidents
We are currently investigating this issue.
Resolved: This incident has been resolved.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
InfluxDB Cloud Outage Survival Guide
InfluxDB Cloud Components
InfluxDB Cloud Cloud Serverless: GCP
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
InfluxDB Cloud Cloud Serverless: AWS, EU-Central
Web UI
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Writes
We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
Investigating: We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
Investigating: We’re aware of an incident affecting your cluster that is impacting both write and read operations. Our team is actively investigating the issue
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Serverless: AWS, US-West-2-1
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Serverless: AWS, US-West-2-2
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Serverless: AWS, US-East-1
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Serverless: Azure, W. Europe
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Serverless: Azure, East US
Web UI
API Writes
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
API Queries
We are currently investigating this issue.
Postmortem: RCA - Query and Write Outage on May 28, 2025
‌
Summary
‌
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.Â
‌
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
‌
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.Â
‌
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
‌
Cause of the Incident
‌
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.Â
‌
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
‌
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
‌
Investigation and Recovery
‌
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
‌
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
‌
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
‌
Future mitigations
‌
We are implementing several methods to reduce the likelihood of a similar incident in the future:Â
‌
- Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
- Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
- Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.
Resolved: All regions fully back online. A full RCA will be provided as soon as it is completed.
Monitoring: AWS EU-Central is now operational. We are continuing to monitor
Identified: All regions except EU Central are now operational. Work continues on EU-Central
Monitoring: A fix has been implemented and we are monitoring the results.
Identified: The issue has been identified and a fix is being implemented.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.
Tasks
Persistent Storage
Compute
Other
InfluxDB Cloud Cloud Dedicated
Management API
We are currently investigating this issue.
Resolved: This incident has been resolved.
Investigating: We are continuing to investigate this issue.
Investigating: We are currently investigating this issue.