Current Status
All Systems Operational
Components
Recent Incidents
Hosted Mender EU - Connectivity issue 2026-06-18
criticalJun 18, 2026 · resolved Jun 18
Incident Summary - 2026-06-18 What happened On June 18, between 18:46 UTC and approximately 19:00 UTC, some users may have experienced intermittent connectivity issues or elevated error rates when accessing Mender services from the EU cluster. The incident was triggered by a scheduled maintenance job running on the infrastructure hosting our services. This job - responsible for cleaning up unused container images across cluster nodes - ran simultaneously on all nodes and consumed a significant amount of CPU resources. This created resource contention with the ingress controller pods responsible for routing incoming traffic, causing them to become temporarily unresponsive to health checks and restart repeatedly. The repeated restarts reduced the number of healthy ingress endpoints below the threshold required to serve traffic across all availability zones, leading to degraded routing and elevated error rates for most of the requests. Resolution The ingress controller stabilized once the image cleanup job completed. This job is a new managed feature coming from the recent Kubernetes upgrades, and since it’s not a critical feature, we suspended it. Additional safeguards have been put in place to prevent ingress controller pods from being affected by resource pressure from unrelated workloads. What we are doing The image cleanup job has been permanently suspended. CPU resource limits have been tightened on the ingress layer to isolate it from competing workloads. Node pool capacity is being expanded with larger instances to provide additional headroom. We apologize for the disruption.
Issues with deployment creation
minorJun 17, 2026 · resolved Jun 17
This incident has been resolved, the rollback worked and we'll apply the fix in the next release.
Issues with the Mender Server UI
criticalApr 8, 2026 · resolved Apr 8
Mender UI was unavailable between 2026-04-08T13:16:15Z and 2026-04-08T13:42:09Z (26m) due to a breaking change to Google Analytics dependency leading to misconfiguration. The issue has been mitigated by temporarily disabling Google Analytics until a fix is deployed.
Degraded performance on hosted Mender
majorFeb 20, 2026 · resolved Feb 20
# **Database Overload from Device Limit Migration Bug** **Date:** 2026-02-20 **Duration:** ~3 hours 25 minutes \(08:15 - 11:40 UTC\) **Severity:** High ## **Executive Summary** On February 20, 2026, the Hosted Mender platform experienced a critical service outage affecting device authentication and inventory operations. A change deployed as part of Mender v4.2.0-saas.2 failed to uniformly handle data inconsistencies in older tenant configurations. A specific call order of two independent backend endpoints in combination with scheduled cache invalidation uncovered a bug which caused a heavy increase in load on the database. The unexpected increase in load was beyond what the system is designed to handle which resulted in cascading errors and platform-wide degradation. ## **Impact** * **Duration:** Approximately 3 hours 25 minutes * **Scope:** Multi-tenant platform-wide degradation * **Affected Services:** Device authentication, device inventory * **User Experience:** Unable to accept new devices * **Business Impact:** Complete halt to device provisioning across the platform \(hosted Mender US only\) during incident window ## **Root Cause** In Mender v4.2.0-saas.2 we changed the definition of an “unlimited” device limit from 0 to -1 so the system would be able to represent limits that allow zero devices. This was done by introducing a database migration that migrated existing limits with the value 0 to have the value -1 instead and cleared the limits cache to ensure data would be collected fresh from the database post migration. Lastly, we were aware of a known edge case where certain tenants would not have a limit defined in the database and took steps to ensure consistent handling of this scenario post migration. This new version of Mender also included an internal endpoint that incorrectly set the cached device limit of a tenant to 0 in the case where a\) there was no limit in the cache from before and b\) there also was no limit in the database. This endpoint was overlooked in the steps mentioned above. When the internal endpoint was called after the cache was invalidated, but before any external endpoints that used device limits, the limit 0 was incorrectly cached for some tenants with a large number of devices and no device limit in the database. When the device authorization reprocessing logic was executed for these devices, the incorrectly cached limit caused a large amount of database queries to be executed in order to check if the limit had been exceeded \(something which is not necessary to check if the device limit is “unlimited”\). No matter the result of the check, a limit of 0 will always result in the device not being allowed to authorize with the system and devices will continuously retry in such a case, amplifying the issue manyfold until eventual and complete MongoDB resources exhaustion. ## **Timeline \(All times UTC\)** **2026-02-19** * **13:16** - Deployed Mender v4.2.0-saas.2 - _Root cause introduced_ **2026-02-20** * **~08:00** - Devices of the affected tenants started the authorization reprocessing process * **08:15** - A synthetic test failure alerted the On-call team * **08:20** - On-call investigated tenant configuration; Admin Panel queries failing with 499/504 due to DB exhaustion * **08:20** - Identified ongoing device authorization reprocessing consuming all database resources * **09:25** - Attempted to stop problematic queries * **10:30** - Discovered blocked queries still holding locks; initiated emergency database scaling * **10:40** - Database scaled; locks cleared; device acceptance partially restored * **11:55** - Cache for device-auth disabled * **11:00** - Added missing limits with value -1 \(unlimited\) in the database affected tenants * **11:40** - Service fully restored ## **What went wrong** 1. **Inadequate test coverage**The test coverage of the internal endpoint was inadequate as it didn’t verify that the correct value was used and cached in this scenario. 2. **Inadequate manual testing**Manual testing was performed, but not with a cache that was explicitly invalidated for this purpose. 3. **Uncontrolled Cascade**The device authorization reprocessing logic had a snowball effect on the platform. ## **Action Items** * Resolve the issue where limits who are intended to be “unlimited” can be incorrectly cached as 0 by this internal endpoint. * Update the device authorization reprocessing logic to not execute unnecessary database queries if the limit is 0. * Review and improve test coverage of the affected endpoints. ## **Conclusions** We want to sincerely apologize for the service disruption you experienced on February 20, 2026. For over three hours, our platform was unable to process device authentication and inventory operations, preventing you from onboarding new devices and managing your fleet. We are committed to prevent this kind of disruption in the future.
Rate limits issue for some customers
majorNov 19, 2025 · resolved Nov 19
This incident has been resolved. However, a rate limit hot fix has been implemented, so we will schedule a new maintenance window soon, to apply the definitive fix.
Get alerted when Hosted Mender goes down
Alert24 monitors Hosted Mender and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.





