Distributed Systems — Prof. Omicini

The CAP Theorem

DS-C1 · The CAP Theorem. Availability, Consistency, Failure in Distributed SystemsAndrea Omicini — DISI, Univ. BolognaA.Y. 2025/2026

In this lesson

1. Availability, Consistency, Failure

Distributed systems promise something that centralised systems cannot: the ability to keep working even when parts break. If one server crashes, another takes over. If a network link goes down, traffic is rerouted. This is the intuitive promise — and it is why we build distributed systems in the first place.

But this promise comes at a cost. The same properties that make a system resilient also introduce fundamental limits. The CAP theorem — first articulated by Eric Brewer in 2000 and proved by Seth Gilbert and Nancy Lynch in 2002 — is the most famous of these limits. It states that a shared-data distributed system cannot simultaneously guarantee consistency (all nodes see the same data at the same time), availability (every request gets a response), and partition tolerance (the system works despite network failures).

Key idea

The CAP theorem is not about what is desirable but about what is possible. It sets a hard boundary on the design space of distributed systems.

2. Hiding Failure in Distributed Systems

A central motivation for distribution is the ability to mask failures. As Friedman and Birman (1996) observed, "being able to keep on providing services in spite of failures is supposed to be one of the main benefits of distributed systems over centralised ones." The idea is intuitive: if one component fails or becomes disconnected, another component replaces it, and the outside world never notices.

When a system can hide most of its failures, it appears to be always working — we call it highly available. The key question for system designers becomes: how much failure can a given system sustain before the failure becomes visible to the user?

The examiner will ask

"What is the relationship between failure masking and availability? Can you give an example where a system fails but a user does not notice?"

3. What We Expect from a Distributed System

Gilbert and Lynch (2012) frame our expectations in terms of two classic properties from concurrent and distributed computing:

PropertySafety / LivenessMeaning
ConsistencySafetyWe want correct behaviour: the system returns the right answer. A read after a write returns the value of that write (or a later one).
AvailabilityLivenessWe want the system to do something: every request to a non-failing node gets a response. The system stays live and responsive.

However, real systems experience power losses, crashes, network failures, message loss, malicious attacks, and Byzantine failures. The environment is unreliable. The tension between what we want (correctness and responsiveness) and the environment we have (unreliable and unpredictable) is the central problem of distributed systems design.

4. Safety vs. Liveness: The Deeper Tradeoff

Gilbert and Lynch (2012) place the CAP theorem in a broader context: "The CAP theorem is one example of a more general tradeoff between safety and liveness in unreliable systems."

Safety properties say "nothing bad happens." In a distributed data store, consistency is a safety property: it says the system must never return stale or conflicting data. Liveness properties say "something good eventually happens." Availability is a liveness property: it says every legitimate request eventually receives a response.

In a reliable system (perfect network, no crashes), safety and liveness can coexist. In an unreliable system — which every real distributed system is — they conflict. This conflict is what CAP captures with precision.

Editor's note

This framing is powerful: FLP impossibility, consensus lower bounds, and many other distributed-systems impossibility results can be understood as specific instances of the safety-liveness tension under asynchrony.

5. CAP at a Glance

Informally, the CAP theorem says: in a distributed application, you can simultaneously provide at most two of the following three properties:

Brewer's original formulation (2000) described this as a conjecture about distributed databases. The community quickly recognised its broader significance: any shared-data distributed system must contend with this limit.

"According to the CAP theorem, it is only possible to simultaneously provide any two of the three following properties in distributed applications: consistency (C), availability (A), and partition tolerance (P)." — Shim, 2012

6. The CAP Theorem Formally

Building on Zhao (2014), we can state the three guarantees more precisely:

GuaranteeDefinition
Consistency (C)The replicated data is always consistent with each other. All reads see the same value regardless of which node answers.
Availability (A)The data is highly available to users. Every request to a non-failing node produces a response.
Partition Tolerance (P)The system can continue providing services even when the network partitions — when communication between some nodes becomes impossible.

The theorem itself is an impossibility result: no shared-data distributed system can satisfy all three simultaneously. Understanding why requires defining each term with care and working through the proof.

7. Pick Two — and Why P Is Different

The classic CAP formulation says "pick two." The three options are:

ChoiceGives upCharacterisation
CAPartition toleranceConsistency and availability, but only when the network is reliable. If a partition occurs, the system must stop accepting writes.
CPAvailabilityConsistency and partition tolerance. The system blocks or rejects requests when it cannot guarantee consistency — e.g., during a partition, some nodes refuse to serve reads.
AP(Strong) consistencyAvailability and partition tolerance. The system keeps responding, but different nodes may return different values during a partition.

However, as the slides note, the three properties are not symmetric:

Reality check

As Grimm et al. (2004) noted, instability is the rule in pervasive systems. For IoT, mobile, and edge computing, P is non-negotiable. The CAP theorem therefore most often forces us to choose between availability and consistency.

8. Interactive: Choose Your CAP Combination

C A P Consistency Availability Partition Tolerance

9. Location-Based Games: A CAP Lens

Location-based games — BotFighters, GeoZombie, Ingress, Pokemon Go, Harry Potter: Wizards Unite, Minecraft Earth — are an excellent real-world illustration of CAP tradeoffs. They share a characteristic architecture: millions of mobile players moving through physical space, with game state that depends on location.

Partition tolerance is non-negotiable

Mobile devices are inherently unstable in network connectivity. Players move in and out of coverage. Players concentrate in some areas (special events) and are sparse in others. Scalability is multi-level: not just total player count, but the number of distinct locations and the density of players at each location.

Cautionary tale

The Chicago 2017 Pokemon Go Fest was a costly disaster: high player density overloaded the network infrastructure, effectively causing a partition — and the system's CAP choices determined how (badly) it degraded.

Consistency and availability tension

Game data consistency (goals, achievements, items) is essential to keep players engaged. Location-based games typically use logically-centralised architectures with spatial replication to reduce latency and improve scalability. When strong consistency is required — e.g., for in-game transactions — players may have to wait for servers to confirm the operation. The first casualty is availability: the spinning wheel of death.

10. Case Study: Pokemon Go on Google Cloud

Pokemon Go's architecture (described in Google Cloud's blog) reveals how CAP choices are engineered in a real system serving millions of concurrent players.

  1. 1. Content delivery: When a player opens the app, all static media are downloaded from Cloud Storage via Cloud CDN, using Google's global edge network for low-latency delivery.
  2. 2. Request routing: Player requests go to an NGINX reverse proxy, which routes them to the Frontend game service hosted on Google Kubernetes Engine (GKE).
  3. 3. Spatial backend: The Spatial Query Backend handles location-based features with a cache sharded by location. It decides which Pokemon appear on the map, which gyms and Pokestops are nearby, and which time zone applies.
  4. 4. Strong consistency for writes: When a player catches a Pokemon, the Frontend (GKE) sends a write event to Google Spanner — a strongly consistent, globally replicated database. The player waits for the write to complete before getting a response.
  5. 5. Logging and analytics: Every player action is recorded in Bigtable (NoSQL) as a Protobuf representation for logging and tracking. It is also published to a Pub/Sub topic for the analysis pipeline.
  6. 6. Regional sync: Multiple players in the same geographical region are kept in sync through deterministic map rendering. Even on different machines, same-location players get the same inputs.
  7. 7. Shared world: All servers synchronise settings changes and event timings, so players feel they share a consistent world.

Observe the CAP strategy: Spanner provides CP (strong consistency, partition-tolerant by design). The spatial cache and Bigtable provide AP (availability with eventual consistency). The system mixes models based on the operation: in-game transactions require CP, while location data and analytics use AP. This is exactly the kind of "maximise combinations" approach Brewer advocated in 2012.

11. Back to CAP: Definitions for a Proof

To move from conjecture to theorem, we need precise definitions. Gilbert and Lynch (2002) formalised the three concepts:

Definition: Availability

"For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response." — Gilbert & Lynch, 2002

If the system is available, we get responses. A non-failing node cannot ignore or drop requests; it must reply.

Definition: Consistency (atomic consistency)

A consistent (atomic) service is modelled as an atomic data object where operations are totally ordered and each operation occurs in a single instant of time. All read operations after a write completes must return the value of that write or a later one.

Note: in the CAP context, "consistency" means atomic consistency (linearisability), not the ACID notion of consistency (which encompasses both correctness and isolation). Here, consistency implies that every read returns the most recent write — as if there were only one copy of the data, even though many replicas exist. If the system is consistent, we get correct responses.

Definition: Network Partition

"When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost." Any pattern of message loss can be modelled as a temporary partition separating the communicating nodes at the instant the message is lost.

Roughly: if node A sends a message to node B but the message does not arrive, the network is partitioned. Availability means B receives the message and responds. Consistency means B's response is correct.

The examiner will ask

"Explain the difference between CAP consistency and ACID consistency. Why does Gilbert and Lynch's definition use the atomic data object abstraction?"

12. The Theorem and Its Proof

Gilbert and Lynch considered three network models: (i) asynchronous network with message loss, (ii) asynchronous network without message loss, and (iii) partially synchronous network with local clocks. For teaching purposes we focus on the asynchronous model: there is no global clock, and nodes act based on local computation and received messages.

The Theorem

CAP Theorem (Gilbert & Lynch, 2002)

"It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: availability, atomic consistency, in all fair executions (including those in which messages are lost)."

Proof by Contradiction

The proof structure is elegant:

  1. Assume atomicity, availability, and partition tolerance all hold simultaneously.
  2. Partition the network into two disjoint, non-empty sets G1 and G2. Communication between them is lost.
  3. Let an atomic object o have initial value v0.
  4. Execution α1: A write of v1 ≠ v0 to o occurs in G1. During α1, no messages cross the partition.
  5. Execution α2: A read of o occurs in G2. During α2, no messages cross the partition.
  6. By availability: α1 completes (the write succeeds in G1), and α2 completes (the read returns a value in G2).
  7. By the partition: G2 never learns about the write. It sees only its local state, which still holds v0. So the read returns v0.
  8. This violates consistency: A read after the write completed should return v1 or a later value, not the stale v0. Contradiction. QED.

The proof shows that any system that tries to guarantee all three properties simultaneously will fail when a partition occurs: it must either block (losing availability) or return stale data (losing consistency).

13. Interactive: Step Through the Proof

14. Building on an Impossibility Result

An impossibility result like CAP might seem discouraging. "Should we stop distributing computational systems?" the slides ask — "in the same way we stopped using axiomatic systems after Godel?"

The answer is clearly no. Negative results set boundaries and reveal the limits of our reach. Understanding what is impossible is what makes good engineering possible: it tells us which tradeoffs are inherent and which are merely accidental. An impossibility result from computer science becomes a guide for engineering methods and practices.

Key insight

CAP does not tell you to give up. It tells you where the design space ends and forces you to make an informed choice about which two properties matter for your application.

15. Partition-Aware Design: Switch Strategy

Brewer himself later refined the CAP theorem (Brewer, 2012), adding crucial nuance:

  1. When the network is partitioned, a distributed system must choose a tradeoff between consistency and availability.
  2. When there is no partition, a system can feature both consistency and availability.

This means the choice is not a permanent architectural decision but a runtime strategy. A well-designed system detects partitions and switches behaviour accordingly:

System stateBehaviour
Normal operation (no partition)Full consistency + availability. All replicas are synchronised, reads are correct, writes are acknowledged.
Partition detectedEnter a degraded mode: either block writes to some replicas (CP) or serve stale data (AP).
Partition healsReconcile differences, merge conflicting writes, restore full consistency, resume normal operation.

This partition-aware approach is the modern interpretation of CAP — not a static choice, but a dynamic policy that maximises consistency and availability given the current network conditions.

16. ACID vs. BASE

The traditional ACID semantics (Atomicity, Consistency, Isolation, Durability) were designed for centralised databases with strong assumptions. Fox et al. (1997) argued that for many Internet services, ACID is too strong — and proposed an alternative.

Atomicity: Transactions are all-or-nothing.
Consistency: Transactions preserve database invariants.
Isolation: Concurrent transactions do not interfere.
Durability: Committed data survives failures.

ACID guarantees strong consistency but at the cost of availability under partition. It is the right choice when correctness is paramount (e.g., financial transactions, inventory management).

Basically Available: The system guarantees a response, even if the data may be stale.
Soft State: State can change over time without input, as replicas converge.
Eventual Consistency: In the absence of updates, all replicas will eventually converge to the same value.

Fox et al. (1997) argued: "Stale data can be temporarily tolerated as long as all copies of data eventually reach consistency after a short time." "Approximate answers (based on stale data or incomplete soft state) delivered quickly may be more valuable than exact answers."

BASE is the natural model for AP systems. It prioritises availability and partition tolerance over immediate consistency.

The contrast is stark:

DimensionACIDBASE
Consistency modelStrong (atomic)Weak / eventual
AvailabilitySacrificed under partitionAlways available
Design philosophyCorrectness firstResponsiveness first
State managementPersistent, durableSoft state, ephemeral
Typical useBanking, ledgers, inventorySocial feeds, caching, analytics

17. Beyond BASE: Cloud Consistency Today

Most cloud services today adopt BASE — eBay, Amazon DynamoDB, and many others. However, the landscape has evolved beyond a simple ACID vs. binary choice:

Newer models — such as consistent soft-state replication (Birman et al., 2012) — attempt to bridge the gap, offering stronger guarantees than eventual consistency while preserving availability. The lesson is that CAP is not a static law but a design constraint that modern systems navigate with increasing sophistication.

Editor's note

The CAP literature is rich with extensions: PACELC (Partition/Else tradeoff), cobalt (consistency-based availability), and hybrid logical clocks. These are outside the scope of this lesson but are natural next steps.

18. State Explorer: Consistency vs. Availability Tradeoff

Use this state explorer to understand how the system moves through different consistency-availability regimes as the network condition changes.

19. Lessons Learnt

The CAP theorem is a cornerstone of distributed systems theory. The key takeaways:

Brewer (2012)

"The modern CAP goal should be to maximise combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations."

Check Your Understanding

1. State the CAP theorem precisely. Why is it an "impossibility result"?

It is impossible in the asynchronous network model to implement a read/write data object that simultaneously guarantees availability, atomic consistency, and partition tolerance in all fair executions (including those with message loss). It is an impossibility result because it proves that no distributed system can achieve all three properties — it is not a matter of better design but a fundamental limit.

2. Explain the difference between CAP consistency and ACID consistency.

CAP consistency (atomic consistency / linearisability) requires that all operations appear to occur in a single instant and that reads always return the most recent write — as if there were only one copy. ACID consistency means a transaction preserves database invariants (e.g., foreign keys, constraints). ACID consistency encompasses both the C and the I (isolation) of ACID, whereas CAP consistency is narrower and stronger in the temporal sense.

3. Why is partition tolerance non-negotiable in most real-world distributed systems?

Because network partitions are inevitable. Mobile devices, IoT, edge computing, WAN links, and cloud infrastructure all experience intermittent connectivity. As Grimm et al. (2004) observed, instability rules in pervasive systems. A system that forfeits P simply stops working when a partition occurs — which is unacceptable for continuously available services. The practical choice is therefore between C and A.

4. Walk through the Gilbert and Lynch proof step by step. What is the key contradiction?

(1) Assume all three properties hold. (2) Partition the network into G1 and G2 with no messages crossing. (3) Object o initially stores v0. (4) Write v1 != v0 in G1 — by availability it completes. (5) Read o in G2 — by availability it completes. (6) Due to the partition, G2 does not receive the write, so the read returns v0. (7) This violates atomic consistency: a read after a write should return the write's value or a later one. Contradiction. QED.

5. How does Pokemon Go's architecture mix CP and AP? Why is this a good CAP strategy?

Pokemon Go uses Google Spanner (CP) for in-game transactions like catching a Pokemon — strong consistency is required for the game economy. It uses Bigtable (AP) for player action logging and analytics, where eventual consistency is acceptable. The spatial cache is also AP, optimising for responsiveness over perfect correctness in location data. This mixed approach reflects Brewer's 2012 recommendation: maximise useful combinations of C and A based on each operation's semantics.

6. What is BASE? How does it relate to the CAP theorem?

BASE stands for Basically Available, Soft state, Eventual consistency. It is the natural data consistency model for AP systems — systems that favour partition tolerance and availability over strong consistency. BASE accepts that data may be stale in the short term, trusting that replicas will converge eventually. It is the practical alternative to ACID for large-scale, partition-tolerant systems.

7. "The CAP theorem most often forces us to choose between availability and consistency." Why "most often"?

Because partition tolerance is not a genuine design choice for real-world distributed systems: partitions are inevitable in any system operating over an unreliable network. This is especially true in the IoT era where pervasive, mobile, and edge computing dominate. Since P is forced, the tradeoff reduces to C vs. A. However, the choice is not binary: there are rich spectrums of both consistency (strong, causal, eventual, read-committed, etc.) and availability (response-time SLAs, error budgets, etc.).

8. What did Brewer add in his 2012 retrospective? Why does it matter?

Brewer clarified that CAP is not a static design choice but a dynamic one: (i) when partitioned, choose C vs. A; (ii) when not partitioned, you can have both. This partition-aware design approach treats the tradeoff as a runtime decision with recovery procedures. The modern CAP goal is to maximise the C-A combination that makes sense for the specific application, with plans for partition operation and post-partition recovery.