Postmortem: ## Customer Impact
Between December 9, 2025, at 4:00 PM UTC and December 10, 2025, at 7:00 AM UTC, customers in the US region experienced intermittent failures and delays when using the Document Understanding service, specifically affecting document extraction and training workloads.
Scope:
The impact was limited to the US region. Customers may have observed increased error rates or delays when processing documents, particularly during peak usage periods. Other regions and services were not affected.
Root Cause
The incident was triggered by a regional capacity constraint in our cloud provider's GPU infrastructure US region. Although quota for GPU resources was sufficient, the cloud provider was unable to allocate new GPU nodes due to high demand and limited physical availability in the region.
As customer traffic increased, clusters attempted to scale up, but new nodes could not be provisioned. This resulted in resource contention between extraction and training workloads.
- Extraction jobs — customer-facing and time-sensitive — began failing intermittently as GPU resources were exhausted.
- Training jobs stalled or were delayed.
Manual attempts to scale node pools or create new pools failed for the same reason. The cloud provider support confirmed that no additional GPU capacity was available and that alternative VM sizes were also constrained. Recovery was delayed because mitigation was only possible by reprioritizing workloads and waiting for GPU resources to free up.
Detection
The issue was first detected on December 9, 2025, at 16:00 UTC by automated monitoring systems reporting increased extraction error rates in the US region. Customer reports followed shortly after.
Initial investigation identified GPU allocation failures in node pools. Status page updates were posted proactively, and severity was escalated as impact was confirmed.
Response
Engineering teams analyzed cluster telemetry, node pool behavior, and platform logs. Multiple attempts to scale node pools or create new ones using alternative VM sizes failed due to our cloud provider's regional capacity issues.
To mitigate customer impact, the team took the following actions:
- Prioritized extraction workloads by scaling down or pausing training jobs, freeing GPU resources for extraction.
- Collaborated with the cloud provider support, who confirmed the capacity shortfall and initiated escalation.
- Monitored service health continuously until GPU resources became available naturally as regional demand fluctuated.
Service stability returned once extraction had sufficient GPU capacity, and training workloads were gradually reintroduced.
Follow-Up Actions
To prevent recurrence, the following improvements are underway:
- Capacity Planning:
Working with our cloud providers to secure additional reserved GPU capacity in the US region and evaluating alternative VM families.
- Workload Prioritization:
Updating autoscaling and scheduling logic so extraction workloads are always prioritized during contention.
- Monitoring & Alerting Enhancements:
Improved detection of early GPU allocation failures and automated workload reprioritization before customer impact.
- Runbook Improvements:
Updating incident response runbooks to streamline cloud provider communication and document successful mitigation steps.
- Cloud Provider Collaboration:
Strengthening engagement with our GPU infrastructure providers to gain better visibility into regional GPU availability and advocate for increased capacity.
We remain committed to delivering a reliable service and minimizing impact from underlying infrastructure constraints. Long-term improvements will continue to be communicated as they are implemented.