DS-C3: The Problem of Consensus in Distributed Systems

1. Agreement Problems in Distributed Systems

Distributed systems are composed of multiple components that interact to achieve a common goal. However, each component has a partial and potentially divergent view of the system state: processes may observe events in different orders, communication delays create uncertainty, and failures introduce conflicting information.

The agreement problem captures this challenge: when the different components of a system have potentially divergent views over the system state, and about what is happening in the system, a coherent overall system behaviour can be achieved only when all the components come to agree about everything relevant for the system itself — state, events, and the ordering of events. The key point is that they reach the very same conclusion, independently of what that conclusion is about.

Key idea

Agreement is about convergence to a shared understanding. It does not matter which specific value the system settles on — what matters is that all correct processes settle on the same value.

Examples of Agreement Problems

Agreement problems appear in many practical scenarios [Fischer, 1983]:

Domain	Agreement needed
Distributed databases	Data managers must agree on whether to commit or abort a distributed transaction
Replicated file systems	Nodes must agree on where file copies reside
Flight control systems	Engine control and flight surface modules must agree on continue/abort for landing
Blockchain / ledgers	Validators must agree on the next block and its ordering
Clock synchronisation	Processes must agree on the current time despite drift

What makes these problems non-trivial is the presence of faults: processes may crash, messages may be lost or delayed, and the system may be subject to unpredictable load. An agreement protocol must work correctly despite these faults — this is the essence of fault-tolerant distributed computing.

2. Basic Ontology: Time & Failure Models

To reason precisely about consensus, we need a shared vocabulary for the key dimensions of distributed systems.

Components & Connectors

Non-trivial computational systems are concurrent or distributed — systems where more than one computation occurs either logically or physically at the same time. These systems are modelled as collections of computing (logical processes) or (physical processors) interconnected via communication channels: either message-passing (one-to-one direct channels) or shared-memory (shared communication channels). The literature often uses processes and processors interchangeably whenever the physical nature does not affect the problem.

Time Model

The time model determines what a process can infer about the state of other processes:

Model	Properties	Implication
Synchronous	Bound Δ on message delay; bound Φ on relative process speed	Accurate failure detection is possible: if no response within Δ + Φ, the process has likely crashed
Asynchronous	No bounds on message delays or process speeds	Impossible to distinguish a crashed process from an extremely slow one

The asynchronous model is the most general and realistic for Internet-scale systems — networks can experience congestion, and processes can be slowed by load. However, it is also the model in which consensus becomes impossible (as FLP shows).

Crucial distinction

In an asynchronous system, you can never be certain whether a process has crashed or is merely slow. This single fact is the root of the FLP impossibility result.

Failure Model

Processes (or processors) may fail in diverse ways, and communication links may fail too. A fault-tolerant system is one that keeps working beyond faults of any sort. The key failure types relevant to consensus:

Crash failure: a process stops permanently (the most common assumption)
Omission failure: a process fails to send or receive messages
Byzantine failure: a process behaves arbitrarily, possibly maliciously
Network partition: communication between groups of processes is lost

Paxos (and most practical consensus algorithms) assumes crash failures only — Byzantine failures require more heavyweight protocols.

3. The Consensus Problem

Consensus is the process by which we reach agreement over system state between unreliable machines connected by (possibly) asynchronous networks. It is the most fundamental agreement problem in distributed systems.

Why Consensus?

When processes fail, the reliable processes need to agree on:

What has happened till now in the system (current state)
What is going to happen from now on (future evolution)

This ensures that the current system state and its evolution are both consistent across all non-faulty replicas. For example, when a crash makes replica synchronisation impossible, a recovery process requires consensus among the surviving replicas about the state of each replica and the events handled or still to handle.

Formal Requirements

A consensus protocol must satisfy three properties:

Property	Meaning
Agreement	All non-faulty processes decide on the same value
Validity	If all non-faulty processes propose the same value v, then any decided value must be v
Termination	Every non-faulty process eventually decides some value

The examiner will ask

Explain why termination is the property that makes consensus hard or even impossible in asynchronous systems. Hint: it requires detecting that other processes have finished, which needs some form of synchrony assumption.

Non-triviality

Besides the core properties, we also require non-triviality: for both y = 0 and y = 1, there exists some initial configuration and protocol execution that leads to y. This rules out trivial protocols that always decide 0 regardless of input.

Additional dependency requirements can refine the problem:

Strong unanimity: if all initial values x_i are the same, the consensus value y must equal x
Weak unanimity: same as strong, but only required when no runtime failures occur

4. Variants of Consensus

The consensus problem appears in several forms depending on the specific requirements [Fischer, 1983]. Understanding these variants helps identify which solution applies to a given practical setting.

The most basic form: achieving consensus on a single bit. There are n processors, possibly faulty. Each processor i has an initial bit x_i. The problem is for the non-faulty processes to agree on a bit y, called the consensus value. Solvability depends on whether a protocol exists such that all non-faulty processors terminate with the same value y.

Analogous to single-bit consensus, except the goal is for non-faulty processes to agree on a vector y. The dependency requirements are:

Strong: for each non-faulty i, y_i = x_i
Weak: same, but only if no failures occur during execution

This is consensus about what every processor thinks the initial values are.

Here, one specific processor (the general) tries to send its initial bit x to all others. All reliable processes must reach consensus on bit y with:

Strong: y = x if the general is not faulty
Weak: y = x if no failures occur during execution

This is also known as the reliable broadcast problem, where the general is called the transmitter.

In distributed databases [Dolev and Strong, 1982], all data manager processes that participated in a given transaction must agree on whether to install the transaction's results or to discard them. Discarding may be necessary if some data managers were unable to carry out the required processing. Whatever decision is made, all data managers must make the same decision to preserve database consistency.

This is the classic two-phase commit (2PC) problem — though 2PC can block under failures, which is why consensus-based approaches like Paxos improve upon it.

Unifying view

Many technical and real-world problems can be reduced to some version of the consensus problem. This abstract representation lets us reason about them in a general, synthetic way and provides general solutions.

5. Consensus & Fault Tolerance

Consensus is particularly important for providing a distributed system with some level of fault tolerance because it is essential to ensure the consistency of replicas.

When a crash makes it impossible to ensure replica synchronisation, a recovery process involves consensus among the many replicas about the state of each replica and the events handled or still to handle. In essence:

Replicas must maintain consistent state despite failures
Consensus provides the mechanism to agree on the current state and next actions
Without consensus, replicas diverge and the system becomes inconsistent

Fundamental tension

Reaching agreement is straightforward if the participating processes and the network are completely reliable. However, real systems face process crashes, network partitioning, lost, distorted, or duplicated messages — possibly including Byzantine failures where faulty processes go completely out of control or behave malevolently.

Any protocol can be overwhelmed by too frequent or severe faults. The best we can hope for is a protocol tolerant to a prescribed number of "expected" faults. Paxos, for example, tolerates up to f crash failures with n ≥ 2f + 1 processes.

6. Impossibility Results: FLP

Distributed computing is rich with impossibility results — "hundreds of impossibility results for distributed computing" [Fich and Ruppert, 2003]. These results are not just academic curiosities: they define the boundary of what is solvable and are essential for understanding the inherent difficulty of distributed problems.

The FLP Theorem

The most famous impossibility result in distributed computing is the FLP theorem (Fischer-Lynch-Paterson, 1985):

FLP Impossibility

In a fully asynchronous system, no consensus protocol can tolerate even one single crash failure under the mere requirement of non-triviality.

Let us be precise about what FLP does and does not say:

The message system is assumed to be reliable — it delivers all messages correctly and exactly once
No Byzantine failures are considered — only a clean crash (stop)
Yet, even under these best-case assumptions, a single crash at an inopportune time can cause any distributed commit protocol to fail to reach agreement
The theorem says that no deterministic protocol can solve consensus in an asynchronous system with even one crash failure

The examiner will ask

Explain the role of asynchrony in the FLP impossibility. Why does the theorem not apply to synchronous systems? Because in synchronous systems, timeouts allow failure detection, which enables the protocol to make progress despite crashes.

Intuition Behind FLP

The core insight is captured by the concept of bivalence. A state is bivalent if both outcomes (0 and 1) are still possible from that state. A state is univalent if only one outcome is possible.

FLP proves that:

Any consensus protocol must start in a bivalent state (otherwise the protocol would be trivial, always deciding the same value)
From any bivalent state, some step (receiving a message) leads to another bivalent state
If a process crashes, the remaining processes cannot distinguish between "crashed" and "slow" — so they might block waiting for a message that never arrives
This makes it impossible to guarantee termination (the third required property of consensus)

In essence: in an asynchronous system, a crashed process looks identical to a very slow one. The system can either wait forever (losing termination) or guess incorrectly (losing agreement). You cannot have both guarantees.

8. Making Impossible Possible

FLP tells us that consensus is impossible in a fully asynchronous system with even one crash. But real-world distributed systems do solve consensus every day. How?

Solutions to FLP

In practice, we accept that sometimes the system will not be available and mitigate this with timers and backoffs. In theory, we make weaker assumptions about synchrony:

Messages will eventually arrive within some (unknown) bound
Process speeds have some (unknown) maximum
The system is mostly synchronous enough for consensus to terminate

Key idea

The FLP impossibility does not mean we should give up. It defines the boundary of what is possible: we need to relax either the asynchrony assumption or the fault-tolerance requirement. Practical consensus algorithms like Paxos relax asynchrony by assuming eventual synchrony — the system may be asynchronous for arbitrary periods, but eventually becomes synchronous long enough to reach agreement.

This is the theme of Howard's (2016) talk "Distributed Consensus: Making Impossible Possible" — how to construct resilient distributed systems on top of unreliable components, starting from Lamport's work on organising parliament for a Greek island (Paxos), through to today's datacenters powering Google, Amazon, and Microsoft.

9. Paxos: Overview & Roles

Paxos [Lamport, 1998] is the most influential consensus algorithm. First presented in 1989 as a technical report by Leslie Lamport, it was recast in a more academically-viable form in 1998. It is used extensively inside Google (Chubby, Google File System, Cloud Bigtable) and many other large-scale systems.

Key Features

Feature	Detail
Fault tolerance	Tolerates up to `f` crash failures where `n` ≥ 2`f` + 1
Network	Completely connected network of `n` processes
Failure model	Crash failures and message loss; Byzantine failures excluded
Guarantees	Always ensures agreement and validity
Termination	Ensured only if there is a sufficiently long interval with no protocol restarts

Process Roles

Paxos defines three logical roles that a process may play:

Role	Responsibility
Proposer	Submits proposed values on behalf of clients. Initiates the consensus protocol.
Acceptor	Decides the candidate values for the final decision. Acceptors form a quorum — a bare majority must agree for a value to be chosen.
Learner	Collects information from the acceptors and reports the final decision back to the clients.

Why quorum?

Using a majority quorum (not all processes) is what gives Paxos its fault tolerance: up to f processes can crash, and the remaining n - f ≥ f + 1 can still form a majority. Any two majorities always intersect, ensuring that a value once chosen remains the only possible choice.

Basic Idea

Paxos is fundamentally a two-phase protocol:

Prepare phase: figure out whether any value has already been chosen
Accept phase: propose the appropriate value (either a previously chosen one or a new one)

The protocol uses monotonically increasing sequence numbers to label proposals, ensuring that later proposals can always override earlier ones if needed.

10. Paxos Phases: Prepare & Promise

A proposal sent by a proposer is a pair (v, n) where v is a value and n is a sequence number. To deal with possible crashes, multiple acceptors are selected for the proposal. The sequence number distinguishes successive attempts to invoke the protocol.

Phase 1: Prepare (Preparatory Phase)

Each proposer sends a proposal (v, n) to each acceptor
If n is the largest sequence number received by an acceptor so far, the acceptor sends an acknowledgement Promise(n, ⊥, ⊥) back to the proposer
This Promise is a commitment: the acceptor will ignore all proposals numbered lower than n
If the acceptor has already accepted a proposal with sequence number n' < n and value v, it responds with Promise(n, v, n') — signalling that the proposer should use v as the proposed value

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3
    P->>A1: Prepare(1)
    P->>A2: Prepare(1)
    P->>A3: Prepare(1)
    A1-->>P: Promise(1, null)
    A2-->>P: Promise(1, null)
    Note over P: Majority (2 of 3) promises received
    A3--xP: (crashed - no response)
    Note over P: Can proceed with 2/3 promises

The examiner will ask

Why does the acceptor need to report a previously accepted value v when responding to a Prepare? Because if a value v was already accepted by some majority, the new proposal must carry forward v to preserve the agreement guarantee — otherwise, two different majorities might choose different values.

11. Paxos Phases: Accept & Learn

Phase 2: Accept (Request for Acceptance)

If a proposer receives Promise(n, ⊥, ⊥) from a majority of acceptors, it sends Accept(v, n) to all acceptors, asking them to accept this value
If, however, an acceptor returned Promise(n, v, n') in Phase 1 (meaning it already accepted v), then the proposer must include the value v with the highest sequence number in its Accept request
An acceptor accepts a proposal (v, n) unless it has already promised to consider a proposal with sequence number n'' > n

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3
    P->>A1: Accept("X", 1)
    P->>A2: Accept("X", 1)
    P->>A3: Accept("X", 1)
    A1-->>P: Accepted("X", 1)
    A2-->>P: Accepted("X", 1)
    Note over P: Majority accepted! Value "X" is chosen.
    A3--xP: (crashed)

Phase 3: Learn (Final Decision)

When a majority of the acceptors accept a proposed value, it becomes the final decision value
The acceptors multicast the accepted value to the learners
Learners determine if a proposal has been accepted by a majority
Learners convey the decision to the client processes invoking the consensus

sequenceDiagram
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant L1 as Learner 1
    participant L2 as Learner 2
    participant C as Client
    A1->>L1: Accepted("X", 1)
    A1->>L2: Accepted("X", 1)
    A2->>L1: Accepted("X", 1)
    A2->>L2: Accepted("X", 1)
    Note over L1,L2: Majority detected: value "X"
    L1->>C: Decision: X
    L2->>C: Decision: X
    Note over C: Both learners report same value

Paxos Observations

An acceptor accepts a proposal (v, n) if it has not sent a Promise for any n' > n
If a proposer sends Accept(v, n) in Phase 2, then either:
- No acceptor in a majority has accepted a proposal with sequence n' < n, or
- v is the value of the highest-numbered proposal among all proposals with n' < n accepted by at least one acceptor in a majority

Livelock possibility

Theoretically, Paxos may suffer from livelock: if two proposers keep issuing Prepare messages with increasing sequence numbers before the other can complete, neither succeeds. In practice, leader election (choosing a single distinguished proposer) prevents this — this is what Multi-Paxos and Raft do.

13. State Machine Replication

State Machine Replication (SMR) [Schneider, 1990] is the general method for implementing fault-tolerant services in distributed systems. It connects directly to consensus:

Key insight

Replication relates to consensus because all (distributed) replicas of a resource need to agree on the same state over time. SMR shifts the focus from consensus over distributed state to consensus over distributed actions on state.

How SMR Works

The same state machine (deterministic) is replicated over a distributed system
Each replica has a consensus module
Any operation on any replicated machine is performed only if and when all machines have agreed on the same ordering of execution
The assumption of determinism is critical: given the same initial state and the same sequence of inputs in the same order, each replica produces the same outputs and arrives at the same final state

flowchart LR
    C1[Client 1] -->|Request| SM[State Machine
Replication Layer]
    C2[Client 2] -->|Request| SM
    SM -->|Consensus on order| CO[Consensus Module]
    CO -->|Ordered log| R1[Replica 1
State Machine]
    CO -->|Ordered log| R2[Replica 2
State Machine]
    CO -->|Ordered log| R3[Replica 3
State Machine]
    R1 -->|Same output| C1
    R2 -->|Same output| C1
    R3 -->|Same output| C1

In SMR, the consensus problem shifts from "what is the current value?" to "in what order should operations be applied?" — this is called total order broadcast or atomic broadcast. Once all replicas agree on the order of operations, they produce identical state transitions deterministically.

14. Multi-Paxos & Beyond

Multi-Paxos

Paxos as described so far solves consensus for a single value. But SMR needs consensus over an array of values — a log of operations. Multi-Paxos [Van Renesse and Altinbuken, 2015] extends basic Paxos in two ways:

One separate Paxos instance (indexed) is used for each entry of the array
A leader election mechanism reduces proposer conflict (dropping off iterations)

With a stable leader, the Prepare phase can be skipped for subsequent log entries (the leader is already known and trusted), making Multi-Paxos efficient in the common case — only the Accept phase runs.

Editor's note

Multi-Paxos is "not really well-specified overall" [Van Renesse and Altinbuken, 2015]; this fuzziness motivated the development of Raft, which provides a clearer specification with the same guarantees.

Other Paxos-Based Algorithms

Algorithm	Key contribution
Fast Paxos [Lamport, 2006]	Reduces message delay by allowing acceptors to propose values directly
ZooKeeper / ZAB [Reed and Junqueira, 2008]	ZooKeeper Atomic Broadcast — used by Apache ZooKeeper for coordination
Egalitarian Paxos [Moraru et al., 2013]	All nodes are equal (no distinguished leader) for better load distribution
Raft [Ongaro and Ousterhout, 2014]	Designed for understandability; uses leader election and log replication

Real-World Example: Chubby

Google's Chubby [Burrows, 2006] is a fault-tolerant distributed lock service that uses Paxos internally. It is used extensively inside Google systems as a reliable lock service and nameserver, powering Google File System (GFS), Cloud Bigtable, and other critical infrastructure. The engineering experience of building Chubby taught the industry many practical lessons about running Paxos in production [Chandra et al., 2007].

15. Lessons Learnt & Conclusion

General Lessons

Impossibility results define the space of solvable problems — knowing what cannot be done is as important as knowing what can
Formal models are essential to a sound engineering of distributed systems — reasoning about the relationship between formal models and practical instances provides the insight for technically-feasible solutions
This is not just for complex settings: even seemingly simple problems (single-bit consensus) reveal deep impossibility results

Specific Lessons

Faults are paradoxical: they are at the same time the sources of the problems of distributed systems, and the reason we need distributed systems
Consensus is a universal modelling tool: it is one of the most expressive and general ways to model relevant application problems in a distributed setting
The complexity of consensus both defies common sense and causes archetypal cases of impossibility (FLP)
Yet, protocols can be defined that actually solve the problem in meaningful cases with some relaxation of basic hypotheses — Paxos and its derivatives are the prime example

Takeaway

Consensus is at the heart of fault-tolerant distributed computing. The FLP impossibility tells us there is no free lunch: any practical solution must relax some assumption (synchrony, determinism, or failure model). Paxos does this by assuming eventual synchrony and tolerating only crash failures, and by guaranteeing safety under all conditions while liveness depends on sufficiently long periods of stability.

Check Your Understanding

Explain the difference between the synchronous and asynchronous time models. Why does this distinction matter for consensus?

In the synchronous model, there is a known bound Δ on message transmission delay and a known bound Φ on relative process speed. This allows accurate failure detection: if no response arrives within Δ + Φ time, the process can be assumed to have crashed. In the asynchronous model, no such bounds exist — a process can never distinguish between a crashed process and an extremely slow one. This distinction matters because FLP proves that consensus is impossible in the asynchronous model with even one crash failure, whereas synchronous systems can solve consensus (e.g., using timeouts).

State the FLP impossibility theorem precisely. What are its assumptions and what does it prove?

FLP Theorem (Fischer, Lynch, Paterson, 1985): In a fully asynchronous distributed system, no deterministic consensus protocol can tolerate even one crash failure under the non-triviality requirement. Assumptions: reliable message delivery (no loss, no duplication), no Byzantine failures, asynchronous communication (unbounded delays), processes can only fail by crashing. It proves that any protocol that guarantees agreement and validity must violate termination in some execution where a single process crashes at an inopportune moment. The proof uses the concept of bivalent states (where both outcomes are still possible) and shows that from any bivalent state, there is always a sequence of events leading to another bivalent state, preventing the protocol from ever deciding.

Describe the three roles in Paxos and their responsibilities. Why are these roles logical rather than physical?

The three logical roles are: Proposer — submits proposed values on behalf of clients, initiates the protocol; Acceptor — decides which values become candidates for the final decision, forms majority quorums; Learner — collects accepted values from acceptors and reports the final decision to clients. These are logical roles because a single physical process may play multiple roles simultaneously (e.g., a server might act as both acceptor and learner). This separation of concerns allows each role to be optimised independently and allows the protocol to be deployed flexibly.

Why does Paxos require n ≥ 2f + 1 processes to tolerate f crash failures?

Paxos uses majority quorums. To tolerate f crash failures, there must be n - f processes still alive. For any two quorums to intersect (which is necessary for correctness — a value chosen by one quorum must be known to the next), we need 2(n - f) > n, which simplifies to n ≥ 2f + 1. Intuitively, each majority quorum has size ⌊n/2⌋ + 1, and with n ≥ 2f + 1, any two majorities always share at least one non-faulty process.

Explain the relationship between consensus and State Machine Replication (SMR).

SMR is a general method for building fault-tolerant services. A deterministic state machine is replicated across multiple servers. The key challenge is ensuring that all replicas process the same sequence of operations in the same order. This is precisely where consensus comes in: replicas use a consensus protocol to agree on the next operation to execute (or, equivalently, on the ordering of operations in a log). Consensus over single values is generalised to consensus over an ordered log of values via protocols like Multi-Paxos or Raft. Once all replicas agree on the order, determinism ensures they all reach the same final state.

What is the difference between basic Paxos and Multi-Paxos? How does leader election help?

Basic Paxos achieves consensus on a single value. Multi-Paxos extends this to an ordered sequence of values (a log) by running one Paxos instance per log entry. Multi-Paxos adds leader election: a distinguished proposer is chosen, which eliminates the possibility of conflicting proposals. With a stable leader, the Prepare phase can be skipped for subsequent log entries (since the leader is already known and trusted by a majority), reducing the protocol to a single message round-trip per entry. This makes Multi-Paxos efficient in the common case while preserving the same safety guarantees as basic Paxos.

In this lesson

1. Agreement Problems in Distributed Systems

Examples of Agreement Problems

2. Basic Ontology: Time & Failure Models

Components & Connectors

Time Model

Failure Model

3. The Consensus Problem

Why Consensus?

Formal Requirements

Non-triviality

4. Variants of Consensus

5. Consensus & Fault Tolerance

6. Impossibility Results: FLP

The FLP Theorem

Intuition Behind FLP

7. Interactive: FLP State Explorer

8. Making Impossible Possible

Solutions to FLP

9. Paxos: Overview & Roles

Key Features

Process Roles

Basic Idea

10. Paxos Phases: Prepare & Promise

Phase 1: Prepare (Preparatory Phase)

11. Paxos Phases: Accept & Learn

Phase 2: Accept (Request for Acceptance)

Phase 3: Learn (Final Decision)

Paxos Observations

12. Interactive: Paxos Consensus Stepper

13. State Machine Replication

How SMR Works

14. Multi-Paxos & Beyond

Multi-Paxos

Other Paxos-Based Algorithms

Real-World Example: Chubby

15. Lessons Learnt & Conclusion

General Lessons

Specific Lessons

Check Your Understanding