
Hosted Mender Status
Real-time updates of Hosted Mender issues and outages
Hosted Mender status is Operational
Hosted Mender US
Hosted Mender EU
Active Incidents
We are currently investigating a performance issue reported by our monitoring system
Postmortem: # Database Overload from Device Limit Migration Bug
Date: 2026-02-20
Duration: ~3 hours 25 minutes (08:15 - 11:40 UTC)
Severity: High
Executive Summary
On February 20, 2026, the Hosted Mender platform experienced a critical service outage affecting device authentication and inventory operations. A change deployed as part of Mender v4.2.0-saas.2 failed to uniformly handle data inconsistencies in older tenant configurations. A specific call order of two independent backend endpoints in combination with scheduled cache invalidation uncovered a bug which caused a heavy increase in load on the database. The unexpected increase in load was beyond what the system is designed to handle which resulted in cascading errors and platform-wide degradation.
Impact
- Duration: Approximately 3 hours 25 minutes
- Scope: Multi-tenant platform-wide degradation
- Affected Services: Device authentication, device inventory
- User Experience: Unable to accept new devices
- Business Impact: Complete halt to device provisioning across the platform (hosted Mender US only) during incident window
Root Cause
In Mender v4.2.0-saas.2 we changed the definition of an “unlimited” device limit from 0 to -1 so the system would be able to represent limits that allow zero devices. This was done by introducing a database migration that migrated existing limits with the value 0 to have the value -1 instead and cleared the limits cache to ensure data would be collected fresh from the database post migration. Lastly, we were aware of a known edge case where certain tenants would not have a limit defined in the database and took steps to ensure consistent handling of this scenario post migration.
This new version of Mender also included an internal endpoint that incorrectly set the cached device limit of a tenant to 0 in the case where a) there was no limit in the cache from before and b) there also was no limit in the database. This endpoint was overlooked in the steps mentioned above.
When the internal endpoint was called after the cache was invalidated, but before any external endpoints that used device limits, the limit 0 was incorrectly cached for some tenants with a large number of devices and no device limit in the database. When the device authorization reprocessing logic was executed for these devices, the incorrectly cached limit caused a large amount of database queries to be executed in order to check if the limit had been exceeded (something which is not necessary to check if the device limit is “unlimited”). No matter the result of the check, a limit of 0 will always result in the device not being allowed to authorize with the system and devices will continuously retry in such a case, amplifying the issue manyfold until eventual and complete MongoDB resources exhaustion.
Timeline (All times UTC)
2026-02-19
- 13:16 - Deployed Mender v4.2.0-saas.2 - Root cause introduced
2026-02-20
- ~08:00 - Devices of the affected tenants started the authorization reprocessing process
- 08:15 - A synthetic test failure alerted the On-call team
- 08:20 - On-call investigated tenant configuration; Admin Panel queries failing with 499/504 due to DB exhaustion
- 08:20 - Identified ongoing device authorization reprocessing consuming all database resources
- 09:25 - Attempted to stop problematic queries
- 10:30 - Discovered blocked queries still holding locks; initiated emergency database scaling
- 10:40 - Database scaled; locks cleared; device acceptance partially restored
- 11:55 - Cache for device-auth disabled
- 11:00 - Added missing limits with value -1 (unlimited) in the database affected tenants
- 11:40 - Service fully restored
What went wrong
- Inadequate test coverageThe test coverage of the internal endpoint was inadequate as it didn’t verify that the correct value was used and cached in this scenario.
- Inadequate manual testingManual testing was performed, but not with a cache that was explicitly invalidated for this purpose.
- Uncontrolled CascadeThe device authorization reprocessing logic had a snowball effect on the platform.
Action Items
- Resolve the issue where limits who are intended to be “unlimited” can be incorrectly cached as 0 by this internal endpoint.
- Update the device authorization reprocessing logic to not execute unnecessary database queries if the limit is 0.
- Review and improve test coverage of the affected endpoints.
Conclusions
We want to sincerely apologize for the service disruption you experienced on February 20, 2026. For over three hours, our platform was unable to process device authentication and inventory operations, preventing you from onboarding new devices and managing your fleet. We are committed to prevent this kind of disruption in the future.
Resolved: This incident has been resolved.
Monitoring: We identified an issue and applied a fix. We're monitoring the results. Now it's possible to accept new devices and the performance issue should be gone.
Investigating: It's not currently possible to accept new devices. We're continuing to investigate the issue.
Investigating: We are currently investigating a performance issue reported by our monitoring system
Recently Resolved Incidents
No recent incidents
Hosted Mender Outage Survival Guide
Hosted Mender Components
Hosted Mender US
We are currently investigating a performance issue reported by our monitoring system
Postmortem: # Database Overload from Device Limit Migration Bug
Date: 2026-02-20
Duration: ~3 hours 25 minutes (08:15 - 11:40 UTC)
Severity: High
Executive Summary
On February 20, 2026, the Hosted Mender platform experienced a critical service outage affecting device authentication and inventory operations. A change deployed as part of Mender v4.2.0-saas.2 failed to uniformly handle data inconsistencies in older tenant configurations. A specific call order of two independent backend endpoints in combination with scheduled cache invalidation uncovered a bug which caused a heavy increase in load on the database. The unexpected increase in load was beyond what the system is designed to handle which resulted in cascading errors and platform-wide degradation.
Impact
- Duration: Approximately 3 hours 25 minutes
- Scope: Multi-tenant platform-wide degradation
- Affected Services: Device authentication, device inventory
- User Experience: Unable to accept new devices
- Business Impact: Complete halt to device provisioning across the platform (hosted Mender US only) during incident window
Root Cause
In Mender v4.2.0-saas.2 we changed the definition of an “unlimited” device limit from 0 to -1 so the system would be able to represent limits that allow zero devices. This was done by introducing a database migration that migrated existing limits with the value 0 to have the value -1 instead and cleared the limits cache to ensure data would be collected fresh from the database post migration. Lastly, we were aware of a known edge case where certain tenants would not have a limit defined in the database and took steps to ensure consistent handling of this scenario post migration.
This new version of Mender also included an internal endpoint that incorrectly set the cached device limit of a tenant to 0 in the case where a) there was no limit in the cache from before and b) there also was no limit in the database. This endpoint was overlooked in the steps mentioned above.
When the internal endpoint was called after the cache was invalidated, but before any external endpoints that used device limits, the limit 0 was incorrectly cached for some tenants with a large number of devices and no device limit in the database. When the device authorization reprocessing logic was executed for these devices, the incorrectly cached limit caused a large amount of database queries to be executed in order to check if the limit had been exceeded (something which is not necessary to check if the device limit is “unlimited”). No matter the result of the check, a limit of 0 will always result in the device not being allowed to authorize with the system and devices will continuously retry in such a case, amplifying the issue manyfold until eventual and complete MongoDB resources exhaustion.
Timeline (All times UTC)
2026-02-19
- 13:16 - Deployed Mender v4.2.0-saas.2 - Root cause introduced
2026-02-20
- ~08:00 - Devices of the affected tenants started the authorization reprocessing process
- 08:15 - A synthetic test failure alerted the On-call team
- 08:20 - On-call investigated tenant configuration; Admin Panel queries failing with 499/504 due to DB exhaustion
- 08:20 - Identified ongoing device authorization reprocessing consuming all database resources
- 09:25 - Attempted to stop problematic queries
- 10:30 - Discovered blocked queries still holding locks; initiated emergency database scaling
- 10:40 - Database scaled; locks cleared; device acceptance partially restored
- 11:55 - Cache for device-auth disabled
- 11:00 - Added missing limits with value -1 (unlimited) in the database affected tenants
- 11:40 - Service fully restored
What went wrong
- Inadequate test coverageThe test coverage of the internal endpoint was inadequate as it didn’t verify that the correct value was used and cached in this scenario.
- Inadequate manual testingManual testing was performed, but not with a cache that was explicitly invalidated for this purpose.
- Uncontrolled CascadeThe device authorization reprocessing logic had a snowball effect on the platform.
Action Items
- Resolve the issue where limits who are intended to be “unlimited” can be incorrectly cached as 0 by this internal endpoint.
- Update the device authorization reprocessing logic to not execute unnecessary database queries if the limit is 0.
- Review and improve test coverage of the affected endpoints.
Conclusions
We want to sincerely apologize for the service disruption you experienced on February 20, 2026. For over three hours, our platform was unable to process device authentication and inventory operations, preventing you from onboarding new devices and managing your fleet. We are committed to prevent this kind of disruption in the future.
Resolved: This incident has been resolved.
Monitoring: We identified an issue and applied a fix. We're monitoring the results. Now it's possible to accept new devices and the performance issue should be gone.
Investigating: It's not currently possible to accept new devices. We're continuing to investigate the issue.
Investigating: We are currently investigating a performance issue reported by our monitoring system
Hosted Mender EU
We are currently investigating a performance issue reported by our monitoring system
Postmortem: # Database Overload from Device Limit Migration Bug
Date: 2026-02-20
Duration: ~3 hours 25 minutes (08:15 - 11:40 UTC)
Severity: High
Executive Summary
On February 20, 2026, the Hosted Mender platform experienced a critical service outage affecting device authentication and inventory operations. A change deployed as part of Mender v4.2.0-saas.2 failed to uniformly handle data inconsistencies in older tenant configurations. A specific call order of two independent backend endpoints in combination with scheduled cache invalidation uncovered a bug which caused a heavy increase in load on the database. The unexpected increase in load was beyond what the system is designed to handle which resulted in cascading errors and platform-wide degradation.
Impact
- Duration: Approximately 3 hours 25 minutes
- Scope: Multi-tenant platform-wide degradation
- Affected Services: Device authentication, device inventory
- User Experience: Unable to accept new devices
- Business Impact: Complete halt to device provisioning across the platform (hosted Mender US only) during incident window
Root Cause
In Mender v4.2.0-saas.2 we changed the definition of an “unlimited” device limit from 0 to -1 so the system would be able to represent limits that allow zero devices. This was done by introducing a database migration that migrated existing limits with the value 0 to have the value -1 instead and cleared the limits cache to ensure data would be collected fresh from the database post migration. Lastly, we were aware of a known edge case where certain tenants would not have a limit defined in the database and took steps to ensure consistent handling of this scenario post migration.
This new version of Mender also included an internal endpoint that incorrectly set the cached device limit of a tenant to 0 in the case where a) there was no limit in the cache from before and b) there also was no limit in the database. This endpoint was overlooked in the steps mentioned above.
When the internal endpoint was called after the cache was invalidated, but before any external endpoints that used device limits, the limit 0 was incorrectly cached for some tenants with a large number of devices and no device limit in the database. When the device authorization reprocessing logic was executed for these devices, the incorrectly cached limit caused a large amount of database queries to be executed in order to check if the limit had been exceeded (something which is not necessary to check if the device limit is “unlimited”). No matter the result of the check, a limit of 0 will always result in the device not being allowed to authorize with the system and devices will continuously retry in such a case, amplifying the issue manyfold until eventual and complete MongoDB resources exhaustion.
Timeline (All times UTC)
2026-02-19
- 13:16 - Deployed Mender v4.2.0-saas.2 - Root cause introduced
2026-02-20
- ~08:00 - Devices of the affected tenants started the authorization reprocessing process
- 08:15 - A synthetic test failure alerted the On-call team
- 08:20 - On-call investigated tenant configuration; Admin Panel queries failing with 499/504 due to DB exhaustion
- 08:20 - Identified ongoing device authorization reprocessing consuming all database resources
- 09:25 - Attempted to stop problematic queries
- 10:30 - Discovered blocked queries still holding locks; initiated emergency database scaling
- 10:40 - Database scaled; locks cleared; device acceptance partially restored
- 11:55 - Cache for device-auth disabled
- 11:00 - Added missing limits with value -1 (unlimited) in the database affected tenants
- 11:40 - Service fully restored
What went wrong
- Inadequate test coverageThe test coverage of the internal endpoint was inadequate as it didn’t verify that the correct value was used and cached in this scenario.
- Inadequate manual testingManual testing was performed, but not with a cache that was explicitly invalidated for this purpose.
- Uncontrolled CascadeThe device authorization reprocessing logic had a snowball effect on the platform.
Action Items
- Resolve the issue where limits who are intended to be “unlimited” can be incorrectly cached as 0 by this internal endpoint.
- Update the device authorization reprocessing logic to not execute unnecessary database queries if the limit is 0.
- Review and improve test coverage of the affected endpoints.
Conclusions
We want to sincerely apologize for the service disruption you experienced on February 20, 2026. For over three hours, our platform was unable to process device authentication and inventory operations, preventing you from onboarding new devices and managing your fleet. We are committed to prevent this kind of disruption in the future.
Resolved: This incident has been resolved.
Monitoring: We identified an issue and applied a fix. We're monitoring the results. Now it's possible to accept new devices and the performance issue should be gone.
Investigating: It's not currently possible to accept new devices. We're continuing to investigate the issue.
Investigating: We are currently investigating a performance issue reported by our monitoring system