Distributed systems promise something that centralised systems cannot: the ability to keep working even when parts break. If one server crashes, another takes over. If a network link goes down, traffic is rerouted. This is the intuitive promise — and it is why we build distributed systems in the first place.
But this promise comes at a cost. The same properties that make a system resilient also introduce fundamental limits. The CAP theorem — first articulated by Eric Brewer in 2000 and proved by Seth Gilbert and Nancy Lynch in 2002 — is the most famous of these limits. It states that a shared-data distributed system cannot simultaneously guarantee consistency (all nodes see the same data at the same time), availability (every request gets a response), and partition tolerance (the system works despite network failures).
The CAP theorem is not about what is desirable but about what is possible. It sets a hard boundary on the design space of distributed systems.
A central motivation for distribution is the ability to mask failures. As Friedman and Birman (1996) observed, "being able to keep on providing services in spite of failures is supposed to be one of the main benefits of distributed systems over centralised ones." The idea is intuitive: if one component fails or becomes disconnected, another component replaces it, and the outside world never notices.
When a system can hide most of its failures, it appears to be always working — we call it highly available. The key question for system designers becomes: how much failure can a given system sustain before the failure becomes visible to the user?
"What is the relationship between failure masking and availability? Can you give an example where a system fails but a user does not notice?"
Gilbert and Lynch (2012) frame our expectations in terms of two classic properties from concurrent and distributed computing:
| Property | Safety / Liveness | Meaning |
|---|---|---|
| Consistency | Safety | We want correct behaviour: the system returns the right answer. A read after a write returns the value of that write (or a later one). |
| Availability | Liveness | We want the system to do something: every request to a non-failing node gets a response. The system stays live and responsive. |
However, real systems experience power losses, crashes, network failures, message loss, malicious attacks, and Byzantine failures. The environment is unreliable. The tension between what we want (correctness and responsiveness) and the environment we have (unreliable and unpredictable) is the central problem of distributed systems design.
Gilbert and Lynch (2012) place the CAP theorem in a broader context: "The CAP theorem is one example of a more general tradeoff between safety and liveness in unreliable systems."
Safety properties say "nothing bad happens." In a distributed data store, consistency is a safety property: it says the system must never return stale or conflicting data. Liveness properties say "something good eventually happens." Availability is a liveness property: it says every legitimate request eventually receives a response.
In a reliable system (perfect network, no crashes), safety and liveness can coexist. In an unreliable system — which every real distributed system is — they conflict. This conflict is what CAP captures with precision.
This framing is powerful: FLP impossibility, consensus lower bounds, and many other distributed-systems impossibility results can be understood as specific instances of the safety-liveness tension under asynchrony.
Informally, the CAP theorem says: in a distributed application, you can simultaneously provide at most two of the following three properties:
Brewer's original formulation (2000) described this as a conjecture about distributed databases. The community quickly recognised its broader significance: any shared-data distributed system must contend with this limit.
"According to the CAP theorem, it is only possible to simultaneously provide any two of the three following properties in distributed applications: consistency (C), availability (A), and partition tolerance (P)." — Shim, 2012
Building on Zhao (2014), we can state the three guarantees more precisely:
| Guarantee | Definition |
|---|---|
| Consistency (C) | The replicated data is always consistent with each other. All reads see the same value regardless of which node answers. |
| Availability (A) | The data is highly available to users. Every request to a non-failing node produces a response. |
| Partition Tolerance (P) | The system can continue providing services even when the network partitions — when communication between some nodes becomes impossible. |
The theorem itself is an impossibility result: no shared-data distributed system can satisfy all three simultaneously. Understanding why requires defining each term with care and working through the proof.
The classic CAP formulation says "pick two." The three options are:
| Choice | Gives up | Characterisation |
|---|---|---|
| CA | Partition tolerance | Consistency and availability, but only when the network is reliable. If a partition occurs, the system must stop accepting writes. |
| CP | Availability | Consistency and partition tolerance. The system blocks or rejects requests when it cannot guarantee consistency — e.g., during a partition, some nodes refuse to serve reads. |
| AP | (Strong) consistency | Availability and partition tolerance. The system keeps responding, but different nodes may return different values during a partition. |
However, as the slides note, the three properties are not symmetric:
As Grimm et al. (2004) noted, instability is the rule in pervasive systems. For IoT, mobile, and edge computing, P is non-negotiable. The CAP theorem therefore most often forces us to choose between availability and consistency.
Location-based games — BotFighters, GeoZombie, Ingress, Pokemon Go, Harry Potter: Wizards Unite, Minecraft Earth — are an excellent real-world illustration of CAP tradeoffs. They share a characteristic architecture: millions of mobile players moving through physical space, with game state that depends on location.
Mobile devices are inherently unstable in network connectivity. Players move in and out of coverage. Players concentrate in some areas (special events) and are sparse in others. Scalability is multi-level: not just total player count, but the number of distinct locations and the density of players at each location.
The Chicago 2017 Pokemon Go Fest was a costly disaster: high player density overloaded the network infrastructure, effectively causing a partition — and the system's CAP choices determined how (badly) it degraded.
Game data consistency (goals, achievements, items) is essential to keep players engaged. Location-based games typically use logically-centralised architectures with spatial replication to reduce latency and improve scalability. When strong consistency is required — e.g., for in-game transactions — players may have to wait for servers to confirm the operation. The first casualty is availability: the spinning wheel of death.
Pokemon Go's architecture (described in Google Cloud's blog) reveals how CAP choices are engineered in a real system serving millions of concurrent players.
Observe the CAP strategy: Spanner provides CP (strong consistency, partition-tolerant by design). The spatial cache and Bigtable provide AP (availability with eventual consistency). The system mixes models based on the operation: in-game transactions require CP, while location data and analytics use AP. This is exactly the kind of "maximise combinations" approach Brewer advocated in 2012.
To move from conjecture to theorem, we need precise definitions. Gilbert and Lynch (2002) formalised the three concepts:
"For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response." — Gilbert & Lynch, 2002
If the system is available, we get responses. A non-failing node cannot ignore or drop requests; it must reply.
A consistent (atomic) service is modelled as an atomic data object where operations are totally ordered and each operation occurs in a single instant of time. All read operations after a write completes must return the value of that write or a later one.
Note: in the CAP context, "consistency" means atomic consistency (linearisability), not the ACID notion of consistency (which encompasses both correctness and isolation). Here, consistency implies that every read returns the most recent write — as if there were only one copy of the data, even though many replicas exist. If the system is consistent, we get correct responses.
"When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost." Any pattern of message loss can be modelled as a temporary partition separating the communicating nodes at the instant the message is lost.
Roughly: if node A sends a message to node B but the message does not arrive, the network is partitioned. Availability means B receives the message and responds. Consistency means B's response is correct.
"Explain the difference between CAP consistency and ACID consistency. Why does Gilbert and Lynch's definition use the atomic data object abstraction?"
Gilbert and Lynch considered three network models: (i) asynchronous network with message loss, (ii) asynchronous network without message loss, and (iii) partially synchronous network with local clocks. For teaching purposes we focus on the asynchronous model: there is no global clock, and nodes act based on local computation and received messages.
"It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: availability, atomic consistency, in all fair executions (including those in which messages are lost)."
The proof structure is elegant:
The proof shows that any system that tries to guarantee all three properties simultaneously will fail when a partition occurs: it must either block (losing availability) or return stale data (losing consistency).
An impossibility result like CAP might seem discouraging. "Should we stop distributing computational systems?" the slides ask — "in the same way we stopped using axiomatic systems after Godel?"
The answer is clearly no. Negative results set boundaries and reveal the limits of our reach. Understanding what is impossible is what makes good engineering possible: it tells us which tradeoffs are inherent and which are merely accidental. An impossibility result from computer science becomes a guide for engineering methods and practices.
CAP does not tell you to give up. It tells you where the design space ends and forces you to make an informed choice about which two properties matter for your application.
Brewer himself later refined the CAP theorem (Brewer, 2012), adding crucial nuance:
This means the choice is not a permanent architectural decision but a runtime strategy. A well-designed system detects partitions and switches behaviour accordingly:
| System state | Behaviour |
|---|---|
| Normal operation (no partition) | Full consistency + availability. All replicas are synchronised, reads are correct, writes are acknowledged. |
| Partition detected | Enter a degraded mode: either block writes to some replicas (CP) or serve stale data (AP). |
| Partition heals | Reconcile differences, merge conflicting writes, restore full consistency, resume normal operation. |
This partition-aware approach is the modern interpretation of CAP — not a static choice, but a dynamic policy that maximises consistency and availability given the current network conditions.
The traditional ACID semantics (Atomicity, Consistency, Isolation, Durability) were designed for centralised databases with strong assumptions. Fox et al. (1997) argued that for many Internet services, ACID is too strong — and proposed an alternative.
Atomicity: Transactions are all-or-nothing.
Consistency: Transactions preserve database invariants.
Isolation: Concurrent transactions do not interfere.
Durability: Committed data survives failures.
ACID guarantees strong consistency but at the cost of availability under partition. It is the right choice when correctness is paramount (e.g., financial transactions, inventory management).
Basically Available: The system guarantees a response, even if the data may be stale.
Soft State: State can change over time without input, as replicas converge.
Eventual Consistency: In the absence of updates, all replicas will eventually converge to the same value.
Fox et al. (1997) argued: "Stale data can be temporarily tolerated as long as all copies of data eventually reach consistency after a short time." "Approximate answers (based on stale data or incomplete soft state) delivered quickly may be more valuable than exact answers."
BASE is the natural model for AP systems. It prioritises availability and partition tolerance over immediate consistency.
The contrast is stark:
| Dimension | ACID | BASE |
|---|---|---|
| Consistency model | Strong (atomic) | Weak / eventual |
| Availability | Sacrificed under partition | Always available |
| Design philosophy | Correctness first | Responsiveness first |
| State management | Persistent, durable | Soft state, ephemeral |
| Typical use | Banking, ledgers, inventory | Social feeds, caching, analytics |
Most cloud services today adopt BASE — eBay, Amazon DynamoDB, and many others. However, the landscape has evolved beyond a simple ACID vs. binary choice:
Newer models — such as consistent soft-state replication (Birman et al., 2012) — attempt to bridge the gap, offering stronger guarantees than eventual consistency while preserving availability. The lesson is that CAP is not a static law but a design constraint that modern systems navigate with increasing sophistication.
The CAP literature is rich with extensions: PACELC (Partition/Else tradeoff), cobalt (consistency-based availability), and hybrid logical clocks. These are outside the scope of this lesson but are natural next steps.
Use this state explorer to understand how the system moves through different consistency-availability regimes as the network condition changes.
The CAP theorem is a cornerstone of distributed systems theory. The key takeaways:
"The modern CAP goal should be to maximise combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations."
It is impossible in the asynchronous network model to implement a read/write data object that simultaneously guarantees availability, atomic consistency, and partition tolerance in all fair executions (including those with message loss). It is an impossibility result because it proves that no distributed system can achieve all three properties — it is not a matter of better design but a fundamental limit.
CAP consistency (atomic consistency / linearisability) requires that all operations appear to occur in a single instant and that reads always return the most recent write — as if there were only one copy. ACID consistency means a transaction preserves database invariants (e.g., foreign keys, constraints). ACID consistency encompasses both the C and the I (isolation) of ACID, whereas CAP consistency is narrower and stronger in the temporal sense.
Because network partitions are inevitable. Mobile devices, IoT, edge computing, WAN links, and cloud infrastructure all experience intermittent connectivity. As Grimm et al. (2004) observed, instability rules in pervasive systems. A system that forfeits P simply stops working when a partition occurs — which is unacceptable for continuously available services. The practical choice is therefore between C and A.
(1) Assume all three properties hold. (2) Partition the network into G1 and G2 with no messages crossing. (3) Object o initially stores v0. (4) Write v1 != v0 in G1 — by availability it completes. (5) Read o in G2 — by availability it completes. (6) Due to the partition, G2 does not receive the write, so the read returns v0. (7) This violates atomic consistency: a read after a write should return the write's value or a later one. Contradiction. QED.
Pokemon Go uses Google Spanner (CP) for in-game transactions like catching a Pokemon — strong consistency is required for the game economy. It uses Bigtable (AP) for player action logging and analytics, where eventual consistency is acceptable. The spatial cache is also AP, optimising for responsiveness over perfect correctness in location data. This mixed approach reflects Brewer's 2012 recommendation: maximise useful combinations of C and A based on each operation's semantics.
BASE stands for Basically Available, Soft state, Eventual consistency. It is the natural data consistency model for AP systems — systems that favour partition tolerance and availability over strong consistency. BASE accepts that data may be stale in the short term, trusting that replicas will converge eventually. It is the practical alternative to ACID for large-scale, partition-tolerant systems.
Because partition tolerance is not a genuine design choice for real-world distributed systems: partitions are inevitable in any system operating over an unreliable network. This is especially true in the IoT era where pervasive, mobile, and edge computing dominate. Since P is forced, the tradeoff reduces to C vs. A. However, the choice is not binary: there are rich spectrums of both consistency (strong, causal, eventual, read-committed, etc.) and availability (response-time SLAs, error budgets, etc.).
Brewer clarified that CAP is not a static design choice but a dynamic one: (i) when partitioned, choose C vs. A; (ii) when not partitioned, you can have both. This partition-aware design approach treats the tradeoff as a runtime decision with recovery procedures. The modern CAP goal is to maximise the C-A combination that makes sense for the specific application, with plans for partition operation and post-partition recovery.