Postmortem: Between 17:25 and 18:25 UTC on Tuesday April 16th 2025, some Availability Query requests were made that provoked an issue with the Availability calculation. This caused resource usage to balloon to the point where some servers ran out of memory. During this time a small number (< 1%) of requests to our US data center received an HTTP 500 response rather than being processed correctly.
The root cause was an issue where some requests to the Availability Engine attempted to return values several orders of magnitude larger than intended, requiring significant resources to process over a period of 20 minutes. The subset of servers handling these requests eventually exhausted their available resources and began to fail. This lead all in-flight requests on those servers to receive a HTTP 500 status. Less than 1% of traffic was impacted between 17:25 - 18:25 UTC.
During the incident, we identified and patched the root cause.
We have also held a retrospective and decided on further actions to take to further harden these paths and monitoring to prevent similar issues and try to make them easier to spot.
Timeline
17:25 UTC - A customer makes some Availability Queries that will go on to cause the memory exhaustion.
17:59 UTC - Several servers run out of memory and are killed.
18:03 UTC - On-call engineers are paged as the number of HTTP 5XX responses increases.
18:05 UTC - Engineers begin investigating, finding little that stands out as traffic and load patterns seem usual.
18:23 UTC - Engineers spot that a small number of servers have unusually high memory usage approaching the limit and are replaced.
18:31 UTC - Memory limits are temporarily increased to see if that eases pressure. It does not.
18:47 UTC - Customer stops making the Availability Query calls that are driving the issue. While normal service resumes quickly, we decide to keep the incident open until we understand and mitigate the root cause.
19:11 UTC - After investigating several areas of the infrastructure configuration, the Availability Query calls driving the memory consumption are spotted
19:57 UTC - A fix has been written and is reviewed
20:03 UTC - Fix is approved and merged
20:22 UTC - After confirming all systems are operating normally, the incident is resolved.
Retrospective
We always ask the questions:
- Could the issue have been resolved sooner?
- Could be issue have been identified sooner?
- Could be issue have been prevented?
In this case, we feel that our time to resolve and identify were fairly good given that a symptom of the issue was a small number of logs and metrics were missing due to the killed servers. Counter-intuitively, a larger scale issue would have been easier to spot, but this one was subtle enough that it took some time to uncover.
Conversely, had the user who incidentally triggered the bug have been malicious, this could have lead to an effective denial of service attack on our system. We have reviewed the steps we would have taken and believe we would have been in a position to also mitigate this behavior, but may have an opportunity for clearer internal playbooks on the topic.
When considering how long it took us to identify the root cause, it took some time to uncover that servers were being killed due to running our of memory. This is not a trivial thing to spot and record, but we are going to look at how we can capture and alert from this data. This would have saved us a few minutes of waiting and watching as the next generation of servers hit the same issue.
Finally, prevention is the clearest step we can improve here. The issue was hard but not impossible to spot, and we are now much stricter on the values allowed in the impacted area of the system. We have also identified several other spots that our code can be more defensive to avoid similar incidents in the future.
Actions
Our Site Reliability Engineers will be looking at how we can track and monitor Out Of Memory errors, while our Product Engineers will be hardening the impacted parts of the Availability Engine to limit the scale of processing without impacting existing behavior.
Further questions?
If you have any further questions, please contact us at [email protected]