Coinbase outage blamed on AWS cooling failure and AI pivot struggles

Coinbase, the publicly traded crypto exchange, faced a significant outage on May 7th that disrupted trading, account access, and balance updates for several hours. The root cause, according to the company, was a cooling failure within an Amazon Web Services data center. This incident has raised questions about Coinbase’s strategy of moving toward AI-driven operations.

The problems started around 23:50 UTC when internal systems flagged widespread quote failures. Engineers quickly escalated the issue to Sev1 incidents, as retail, institutional, and derivatives trading were all affected. CEO Brian Armstrong posted on X that the outage was “never acceptable” and explained that a room in an AWS data center overheated due to multiple chiller failures.

Infrastructure design and the exchange’s weak point

Armstrong noted that most of Coinbase’s services are designed to survive an AWS availability zone failure. However, the exchange itself relies on a different setup to meet low-latency demands. Rob Witoff, who leads the platform team, provided more technical context. He said the outage began with a “thermal event” in a small number of server racks in AWS us-east-1.

Unlike other services, Coinbase keeps its exchange infrastructure in a single availability zone to prioritize speed, Witoff explained. While a distributed backup copy exists, the failure did not stay contained. Two key components failed: hardware below the matching engine, and the distributed Kafka cluster used for internal data sharing. Recovering Kafka required moving terabytes of data to new hardware.

The matching engine stall and recovery process

The matching engine, which processes orders and maintains order books, caused the biggest trading disruption. It operates as a distributed cluster and needs a majority of nodes healthy to elect a leader and execute trades. During the outage, too many nodes were down, so trading on Retail, Advanced, and Institutional exchanges stopped.

Witoff said on-call engineers had to run disaster recovery procedures, re-establish quorum, and assess system health under difficult conditions. They also had to create, test, deploy, and validate fixes while managing the broader outage. Balance updates were delayed because Kafka was behind, but once replication caught up, those issues resolved. Coinbase claims no data was lost.

Reopening markets and broader context

When the matching engine came back, markets were not enabled all at once. First, Coinbase switched products to cancel-only mode, checked statuses, moved to auction mode, and finally allowed trading again. Witoff emphasized that customers should not have been permanently locked out of their accounts. A full incident report is expected in a few weeks.

This outage comes as Coinbase lays off about 700 workers, or 14% of its staff, to replace manual processes with AI. Some have questioned whether the AI pivot is going well. But Josh Ellithorpe, responding to Witoff’s post, pushed back on speculation. He argued that no one “vibe coded” a failure, and that Coinbase did design a failover system. “Things happen at scale,” he said, urging critics not to jump to conclusions.

Share this article