DS-C1 — The CAP Theorem

1. Availability, Consistency, Failure

Distributed systems promise something that centralised systems cannot: the ability to keep working even when parts break. If one server crashes, another takes over. If a network link goes down, traffic is rerouted. This is the intuitive promise — and it is why we build distributed systems in the first place.

But this promise comes at a cost. The same properties that make a system resilient also introduce fundamental limits. The CAP theorem — first articulated by Eric Brewer in 2000 and proved by Seth Gilbert and Nancy Lynch in 2002 — is the most famous of these limits. It states that a shared-data distributed system cannot simultaneously guarantee consistency (all nodes see the same data at the same time), availability (every request gets a response), and partition tolerance (the system works despite network failures).

Key idea

The CAP theorem is not about what is desirable but about what is possible. It sets a hard boundary on the design space of distributed systems.

2. Hiding Failure in Distributed Systems

A central motivation for distribution is the ability to mask failures. As Friedman and Birman (1996) observed, "being able to keep on providing services in spite of failures is supposed to be one of the main benefits of distributed systems over centralised ones." The idea is intuitive: if one component fails or becomes disconnected, another component replaces it, and the outside world never notices.

When a system can hide most of its failures, it appears to be always working — we call it highly available. The key question for system designers becomes: how much failure can a given system sustain before the failure becomes visible to the user?

The examiner will ask

"What is the relationship between failure masking and availability? Can you give an example where a system fails but a user does not notice?"

3. What We Expect from a Distributed System

Gilbert and Lynch (2012) frame our expectations in terms of two classic properties from concurrent and distributed computing:

Property	Safety / Liveness	Meaning
Consistency	Safety	We want correct behaviour: the system returns the right answer. A read after a write returns the value of that write (or a later one).
Availability	Liveness	We want the system to do something: every request to a non-failing node gets a response. The system stays live and responsive.

However, real systems experience power losses, crashes, network failures, message loss, malicious attacks, and Byzantine failures. The environment is unreliable. The tension between what we want (correctness and responsiveness) and the environment we have (unreliable and unpredictable) is the central problem of distributed systems design.

4. Safety vs. Liveness: The Deeper Tradeoff

Gilbert and Lynch (2012) place the CAP theorem in a broader context: "The CAP theorem is one example of a more general tradeoff between safety and liveness in unreliable systems."

Safety properties say "nothing bad happens." In a distributed data store, consistency is a safety property: it says the system must never return stale or conflicting data. Liveness properties say "something good eventually happens." Availability is a liveness property: it says every legitimate request eventually receives a response.

In a reliable system (perfect network, no crashes), safety and liveness can coexist. In an unreliable system — which every real distributed system is — they conflict. This conflict is what CAP captures with precision.

Editor's note

This framing is powerful: FLP impossibility, consensus lower bounds, and many other distributed-systems impossibility results can be understood as specific instances of the safety-liveness tension under asynchrony.

5. CAP at a Glance

Informally, the CAP theorem says: in a distributed application, you can simultaneously provide at most two of the following three properties:

C — Consistency: All nodes see the same data at the same time. The replicated data is always consistent with each other.
A — Availability: Every request receives a response, even if some nodes are down.
P — Partition Tolerance: The system continues operating despite network partitions that break communication between nodes.

Brewer's original formulation (2000) described this as a conjecture about distributed databases. The community quickly recognised its broader significance: any shared-data distributed system must contend with this limit.

"According to the CAP theorem, it is only possible to simultaneously provide any two of the three following properties in distributed applications: consistency (C), availability (A), and partition tolerance (P)." — Shim, 2012

Guarantee	Definition
Consistency (C)	The replicated data is always consistent with each other. All reads see the same value regardless of which node answers.
Availability (A)	The data is highly available to users. Every request to a non-failing node produces a response.
Partition Tolerance (P)	The system can continue providing services even when the network partitions — when communication between some nodes becomes impossible.

7. Pick Two — and Why P Is Different

The classic CAP formulation says "pick two." The three options are:

Choice	Gives up	Characterisation
CA	Partition tolerance	Consistency and availability, but only when the network is reliable. If a partition occurs, the system must stop accepting writes.
CP	Availability	Consistency and partition tolerance. The system blocks or rejects requests when it cannot guarantee consistency — e.g., during a partition, some nodes refuse to serve reads.
AP	(Strong) consistency	Availability and partition tolerance. The system keeps responding, but different nodes may return different values during a partition.

However, as the slides note, the three properties are not symmetric:

Consistency and availability are spectra: There are many shades of consistency (strong, eventual, causal, read-committed, etc.) and many degrees of availability (five nines, four nines, etc.).
Partition tolerance is essentially binary: Either your system survives network partitions or it does not. Given the realities of modern infrastructure — especially in IoT, edge, and mobile systems — forfeiting partition tolerance is rarely a realistic option.

Reality check

As Grimm et al. (2004) noted, instability is the rule in pervasive systems. For IoT, mobile, and edge computing, P is non-negotiable. The CAP theorem therefore most often forces us to choose between availability and consistency.

9. Location-Based Games: A CAP Lens

Location-based games — BotFighters, GeoZombie, Ingress, Pokemon Go, Harry Potter: Wizards Unite, Minecraft Earth — are an excellent real-world illustration of CAP tradeoffs. They share a characteristic architecture: millions of mobile players moving through physical space, with game state that depends on location.

Partition tolerance is non-negotiable

Mobile devices are inherently unstable in network connectivity. Players move in and out of coverage. Players concentrate in some areas (special events) and are sparse in others. Scalability is multi-level: not just total player count, but the number of distinct locations and the density of players at each location.

Cautionary tale

The Chicago 2017 Pokemon Go Fest was a costly disaster: high player density overloaded the network infrastructure, effectively causing a partition — and the system's CAP choices determined how (badly) it degraded.

Consistency and availability tension

Game data consistency (goals, achievements, items) is essential to keep players engaged. Location-based games typically use logically-centralised architectures with spatial replication to reduce latency and improve scalability. When strong consistency is required — e.g., for in-game transactions — players may have to wait for servers to confirm the operation. The first casualty is availability: the spinning wheel of death.

10. Case Study: Pokemon Go on Google Cloud

Pokemon Go's architecture (described in Google Cloud's blog) reveals how CAP choices are engineered in a real system serving millions of concurrent players.

1. Content delivery: When a player opens the app, all static media are downloaded from Cloud Storage via Cloud CDN, using Google's global edge network for low-latency delivery.
2. Request routing: Player requests go to an NGINX reverse proxy, which routes them to the Frontend game service hosted on Google Kubernetes Engine (GKE).
3. Spatial backend: The Spatial Query Backend handles location-based features with a cache sharded by location. It decides which Pokemon appear on the map, which gyms and Pokestops are nearby, and which time zone applies.
4. Strong consistency for writes: When a player catches a Pokemon, the Frontend (GKE) sends a write event to Google Spanner — a strongly consistent, globally replicated database. The player waits for the write to complete before getting a response.
5. Logging and analytics: Every player action is recorded in Bigtable (NoSQL) as a Protobuf representation for logging and tracking. It is also published to a Pub/Sub topic for the analysis pipeline.
6. Regional sync: Multiple players in the same geographical region are kept in sync through deterministic map rendering. Even on different machines, same-location players get the same inputs.
7. Shared world: All servers synchronise settings changes and event timings, so players feel they share a consistent world.

Observe the CAP strategy: Spanner provides CP (strong consistency, partition-tolerant by design). The spatial cache and Bigtable provide AP (availability with eventual consistency). The system mixes models based on the operation: in-game transactions require CP, while location data and analytics use AP. This is exactly the kind of "maximise combinations" approach Brewer advocated in 2012.

11. Back to CAP: Definitions for a Proof

To move from conjecture to theorem, we need precise definitions. Gilbert and Lynch (2002) formalised the three concepts:

Definition: Availability

"For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response." — Gilbert & Lynch, 2002

If the system is available, we get responses. A non-failing node cannot ignore or drop requests; it must reply.

Definition: Consistency (atomic consistency)

A consistent (atomic) service is modelled as an atomic data object where operations are totally ordered and each operation occurs in a single instant of time. All read operations after a write completes must return the value of that write or a later one.

Note: in the CAP context, "consistency" means atomic consistency (linearisability), not the ACID notion of consistency (which encompasses both correctness and isolation). Here, consistency implies that every read returns the most recent write — as if there were only one copy of the data, even though many replicas exist. If the system is consistent, we get correct responses.

Definition: Network Partition

"When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost." Any pattern of message loss can be modelled as a temporary partition separating the communicating nodes at the instant the message is lost.

Roughly: if node A sends a message to node B but the message does not arrive, the network is partitioned. Availability means B receives the message and responds. Consistency means B's response is correct.

The examiner will ask

"Explain the difference between CAP consistency and ACID consistency. Why does Gilbert and Lynch's definition use the atomic data object abstraction?"

12. The Theorem and Its Proof

Gilbert and Lynch considered three network models: (i) asynchronous network with message loss, (ii) asynchronous network without message loss, and (iii) partially synchronous network with local clocks. For teaching purposes we focus on the asynchronous model: there is no global clock, and nodes act based on local computation and received messages.

The Theorem

CAP Theorem (Gilbert & Lynch, 2002)

"It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: availability, atomic consistency, in all fair executions (including those in which messages are lost)."

Proof by Contradiction

The proof structure is elegant:

Assume atomicity, availability, and partition tolerance all hold simultaneously.
Partition the network into two disjoint, non-empty sets G₁ and G₂. Communication between them is lost.
Let an atomic object o have initial value v₀.
Execution α₁: A write of v₁ ≠ v₀ to o occurs in G₁. During α₁, no messages cross the partition.
Execution α₂: A read of o occurs in G₂. During α₂, no messages cross the partition.
By availability: α₁ completes (the write succeeds in G₁), and α₂ completes (the read returns a value in G₂).
By the partition: G₂ never learns about the write. It sees only its local state, which still holds v₀. So the read returns v₀.
This violates consistency: A read after the write completed should return v₁ or a later value, not the stale v₀. Contradiction. QED.

The proof shows that any system that tries to guarantee all three properties simultaneously will fail when a partition occurs: it must either block (losing availability) or return stale data (losing consistency).

15. Partition-Aware Design: Switch Strategy

Brewer himself later refined the CAP theorem (Brewer, 2012), adding crucial nuance:

When the network is partitioned, a distributed system must choose a tradeoff between consistency and availability.
When there is no partition, a system can feature both consistency and availability.

This means the choice is not a permanent architectural decision but a runtime strategy. A well-designed system detects partitions and switches behaviour accordingly:

System state	Behaviour
Normal operation (no partition)	Full consistency + availability. All replicas are synchronised, reads are correct, writes are acknowledged.
Partition detected	Enter a degraded mode: either block writes to some replicas (CP) or serve stale data (AP).
Partition heals	Reconcile differences, merge conflicting writes, restore full consistency, resume normal operation.

This partition-aware approach is the modern interpretation of CAP — not a static choice, but a dynamic policy that maximises consistency and availability given the current network conditions.

16. ACID vs. BASE

The traditional ACID semantics (Atomicity, Consistency, Isolation, Durability) were designed for centralised databases with strong assumptions. Fox et al. (1997) argued that for many Internet services, ACID is too strong — and proposed an alternative.

Atomicity: Transactions are all-or-nothing.
Consistency: Transactions preserve database invariants.
Isolation: Concurrent transactions do not interfere.
Durability: Committed data survives failures.

ACID guarantees strong consistency but at the cost of availability under partition. It is the right choice when correctness is paramount (e.g., financial transactions, inventory management).

Basically Available: The system guarantees a response, even if the data may be stale.
Soft State: State can change over time without input, as replicas converge.
Eventual Consistency: In the absence of updates, all replicas will eventually converge to the same value.

Fox et al. (1997) argued: "Stale data can be temporarily tolerated as long as all copies of data eventually reach consistency after a short time." "Approximate answers (based on stale data or incomplete soft state) delivered quickly may be more valuable than exact answers."

BASE is the natural model for AP systems. It prioritises availability and partition tolerance over immediate consistency.

The contrast is stark:

Dimension	ACID	BASE
Consistency model	Strong (atomic)	Weak / eventual
Availability	Sacrificed under partition	Always available
Design philosophy	Correctness first	Responsiveness first
State management	Persistent, durable	Soft state, ephemeral
Typical use	Banking, ledgers, inventory	Social feeds, caching, analytics

17. Beyond BASE: Cloud Consistency Today

Most cloud services today adopt BASE — eBay, Amazon DynamoDB, and many others. However, the landscape has evolved beyond a simple ACID vs. binary choice:

Amazon S3 now allows for strong read-after-write consistency (not just eventual). The same object store can serve both strongly consistent and eventually consistent workloads.
Amazon DynamoDB writes are eventually consistent by default; reads are eventually consistent too. But strongly consistent reads can be configured — at a higher cost. This exposes CAP as a cost-performance tradeoff, not just a correctness choice.
Systems increasingly mask inconsistencies from users through client-side caching, version vectors, and reconciliation logic.

Newer models — such as consistent soft-state replication (Birman et al., 2012) — attempt to bridge the gap, offering stronger guarantees than eventual consistency while preserving availability. The lesson is that CAP is not a static law but a design constraint that modern systems navigate with increasing sophistication.

Editor's note

The CAP literature is rich with extensions: PACELC (Partition/Else tradeoff), cobalt (consistency-based availability), and hybrid logical clocks. These are outside the scope of this lesson but are natural next steps.

19. Lessons Learnt

The CAP theorem is a cornerstone of distributed systems theory. The key takeaways:

In the design and development of distributed systems, we must typically choose between a responsive system and a fully-consistent one when the network fails.
We can leverage CAP to design systems with different features depending on system status — partition-aware design turns an impossibility result into an engineering strategy.
ACID is the traditional model for reliable data management, but less stringent models — in particular, BASE — are both theoretically justified and practically useful in real-world distributed systems where availability matters more than immediate consistency.
The CAP theorem, like all impossibility results in computer science, does not paralyse us — it informs our engineering choices and helps us understand the fundamental boundaries of what distributed systems can achieve.

Brewer (2012)

"The modern CAP goal should be to maximise combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations."

Check Your Understanding

1. State the CAP theorem precisely. Why is it an "impossibility result"?

It is impossible in the asynchronous network model to implement a read/write data object that simultaneously guarantees availability, atomic consistency, and partition tolerance in all fair executions (including those with message loss). It is an impossibility result because it proves that no distributed system can achieve all three properties — it is not a matter of better design but a fundamental limit.

2. Explain the difference between CAP consistency and ACID consistency.

CAP consistency (atomic consistency / linearisability) requires that all operations appear to occur in a single instant and that reads always return the most recent write — as if there were only one copy. ACID consistency means a transaction preserves database invariants (e.g., foreign keys, constraints). ACID consistency encompasses both the C and the I (isolation) of ACID, whereas CAP consistency is narrower and stronger in the temporal sense.

3. Why is partition tolerance non-negotiable in most real-world distributed systems?

Because network partitions are inevitable. Mobile devices, IoT, edge computing, WAN links, and cloud infrastructure all experience intermittent connectivity. As Grimm et al. (2004) observed, instability rules in pervasive systems. A system that forfeits P simply stops working when a partition occurs — which is unacceptable for continuously available services. The practical choice is therefore between C and A.

4. Walk through the Gilbert and Lynch proof step by step. What is the key contradiction?

(1) Assume all three properties hold. (2) Partition the network into G1 and G2 with no messages crossing. (3) Object o initially stores v0. (4) Write v1 != v0 in G1 — by availability it completes. (5) Read o in G2 — by availability it completes. (6) Due to the partition, G2 does not receive the write, so the read returns v0. (7) This violates atomic consistency: a read after a write should return the write's value or a later one. Contradiction. QED.

5. How does Pokemon Go's architecture mix CP and AP? Why is this a good CAP strategy?

Pokemon Go uses Google Spanner (CP) for in-game transactions like catching a Pokemon — strong consistency is required for the game economy. It uses Bigtable (AP) for player action logging and analytics, where eventual consistency is acceptable. The spatial cache is also AP, optimising for responsiveness over perfect correctness in location data. This mixed approach reflects Brewer's 2012 recommendation: maximise useful combinations of C and A based on each operation's semantics.

6. What is BASE? How does it relate to the CAP theorem?

BASE stands for Basically Available, Soft state, Eventual consistency. It is the natural data consistency model for AP systems — systems that favour partition tolerance and availability over strong consistency. BASE accepts that data may be stale in the short term, trusting that replicas will converge eventually. It is the practical alternative to ACID for large-scale, partition-tolerant systems.

7. "The CAP theorem most often forces us to choose between availability and consistency." Why "most often"?

Because partition tolerance is not a genuine design choice for real-world distributed systems: partitions are inevitable in any system operating over an unreliable network. This is especially true in the IoT era where pervasive, mobile, and edge computing dominate. Since P is forced, the tradeoff reduces to C vs. A. However, the choice is not binary: there are rich spectrums of both consistency (strong, causal, eventual, read-committed, etc.) and availability (response-time SLAs, error budgets, etc.).

8. What did Brewer add in his 2012 retrospective? Why does it matter?

Brewer clarified that CAP is not a static design choice but a dynamic one: (i) when partitioned, choose C vs. A; (ii) when not partitioned, you can have both. This partition-aware design approach treats the tradeoff as a runtime decision with recovery procedures. The modern CAP goal is to maximise the C-A combination that makes sense for the specific application, with plans for partition operation and post-partition recovery.

The CAP Theorem

In this lesson

1. Availability, Consistency, Failure

2. Hiding Failure in Distributed Systems

3. What We Expect from a Distributed System

4. Safety vs. Liveness: The Deeper Tradeoff

5. CAP at a Glance

6. The CAP Theorem Formally

7. Pick Two — and Why P Is Different

8. Interactive: Choose Your CAP Combination

9. Location-Based Games: A CAP Lens

Partition tolerance is non-negotiable

Consistency and availability tension

10. Case Study: Pokemon Go on Google Cloud

11. Back to CAP: Definitions for a Proof

12. The Theorem and Its Proof

The Theorem

Proof by Contradiction

13. Interactive: Step Through the Proof

14. Building on an Impossibility Result

15. Partition-Aware Design: Switch Strategy

16. ACID vs. BASE

17. Beyond BASE: Cloud Consistency Today

18. State Explorer: Consistency vs. Availability Tradeoff

19. Lessons Learnt

Check Your Understanding