Current Status

All Systems Operational

View Buildkite status page ↗

Components

REST API

Operational

Web

Operational

AWS ec2-us-east-1

Operational

GitHub

Operational

GitHub Commit Status Notifications

Operational

Web

Operational

Hosted Agents

Operational

Web

Operational

Email Notifications

Operational

Agent API

Operational

AWS elasticache-us-east-1

Operational

GitHub API Requests

Operational

Ingestion

Operational

MacOS

Operational

Package Managers - API

Operational

Remote MCP Server

Operational

Slack Notifications

Operational

AWS elb-us-east-1

Operational

REST API

Operational

GitHub Webhooks

Operational

Recent Incidents

Increased latency and error rates

major

Jun 18, 2026 · resolved Jun 18

We have seen a full recovery of services.

Increased latency on REST and GraphQL APIs

major

Jun 11, 2026 · resolved Jun 12

The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.

Increased latency and error rates for Agent API

minor

Jun 11, 2026 · resolved Jun 11

Between 00:05 - 00:34 UTC, a subset of customers experienced increased latency and timeout errors on the Agent API. This impacts job assignment. At peak impact, we saw an error rate of 1.3% of requests and job acceptance latency up to 53s.

Email deliveries are delayed

none

May 30, 2026 · resolved May 30

We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.

Delayed notifications

major

May 28, 2026 · resolved May 28

## Service Impact Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows. ## Incident Summary On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario. ### Impact window 1 At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC. ### Impact window 2 A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services. ### Customer Impact The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident. ## Changes we're making We have made the following immediate changes: * Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code. * Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling. * Moved affected notification-processing workloads back to known-good ECS capacity. * Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays. We are also making the following reliability improvements: * Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback. * Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load. * Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods. * Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier. * Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work. ## Areas we are improving: incident communication During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows. We are improving how we communicate during notification latency incidents by: * Updating the status page earlier when notification latency is likely to affect customer workflows * Making status page updates clearer about the customer-visible impact, not just the affected internal service * Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood * Using customer-level notification latency monitoring to help identify affected customers sooner

Get alerted when Buildkite goes down

Alert24 monitors Buildkite and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.

Start free — no credit card