Identified: As per Microsoft, this issue has been mitigated. Here is the incident summary, shared by Microsoft
SUMMARY OF IMPACT:
What happened?
Between 09:12 UTC and 18:50 UTC on 10 September 2025, a platform issue resulted in an impact to multiple Azure services in the East US 2 region, more specifically two zones (Az02 and Az03). Impacted customers may have experienced error notifications when performing service management operations - such as create, delete, update, scaling, start or stop - for resources hosted in this region. The primary impacted service affected was Virtual Machines or Virtual Machines Scale Sets, but this would have resulted in issues for services dependent upon such Compute resources, such as Azure Databricks, Azure Kubernetes Service, Azure Synapse Analytics, Backup, and Data Factory.
Customers that still see failed or unhealthy resources should attempt to update or redeploy the resource.
What do we know so far?
Our investigation identified that the issue impacting resource provisioning in East US 2 was linked to a failure in the platform component responsible for managing resource placement. The system is designed to recover quickly from transient issues, but in this case, the prolonged performance degradation caused recovery mechanisms themselves to become a source of instability.
The incident was primarily driven by a combination of platform recovery behavior and sustained performance degradation. While customer-generated load remained within expected limits, internal platform services began retrying failed operations aggressively when performance issues emerged. These retries, intended to support resilience, instead created a surge in internal system activity.
How did we respond?
• 09:12 UTC on 10 September 2025 – Customer impact began.
• 09:13 UTC on 10 September 2025– Our monitoring systems observed a rise in failure rates, triggering an alert and prompting our team to initiate an investigation.
• 12:08 UTC on 10 September 2025 – We identified unhealthy dependencies in core infrastructure components as initial contributing factors.
• 13:34 UTC on 10 September 2025 – Began mitigation efforts that included - Restarted critical service components to restore functionality, reroute workloads from affected infrastructure, initiated multiple recovery cycles for the impacted backend service, on recovery, internal workloads processed through backlogs to get to the current healthy state, and executed capacity operations to free up resources.
• 18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored, and no further impact was observed to downstream services for this issue.