Distributed Systems — Prof. Omicini

The Problem of Consensus in Distributed Systems

The Problem of Consensus in Distributed Systems — A.Y. 2025/2026

In this lesson

1. Agreement Problems in Distributed Systems

Distributed systems are composed of multiple components that interact to achieve a common goal. However, each component has a partial and potentially divergent view of the system state: processes may observe events in different orders, communication delays create uncertainty, and failures introduce conflicting information.

The agreement problem captures this challenge: when the different components of a system have potentially divergent views over the system state, and about what is happening in the system, a coherent overall system behaviour can be achieved only when all the components come to agree about everything relevant for the system itself — state, events, and the ordering of events. The key point is that they reach the very same conclusion, independently of what that conclusion is about.

Key idea

Agreement is about convergence to a shared understanding. It does not matter which specific value the system settles on — what matters is that all correct processes settle on the same value.

Examples of Agreement Problems

Agreement problems appear in many practical scenarios [Fischer, 1983]:

DomainAgreement needed
Distributed databasesData managers must agree on whether to commit or abort a distributed transaction
Replicated file systemsNodes must agree on where file copies reside
Flight control systemsEngine control and flight surface modules must agree on continue/abort for landing
Blockchain / ledgersValidators must agree on the next block and its ordering
Clock synchronisationProcesses must agree on the current time despite drift

What makes these problems non-trivial is the presence of faults: processes may crash, messages may be lost or delayed, and the system may be subject to unpredictable load. An agreement protocol must work correctly despite these faults — this is the essence of fault-tolerant distributed computing.

2. Basic Ontology: Time & Failure Models

To reason precisely about consensus, we need a shared vocabulary for the key dimensions of distributed systems.

Components & Connectors

Non-trivial computational systems are concurrent or distributed — systems where more than one computation occurs either logically or physically at the same time. These systems are modelled as collections of computing (logical processes) or (physical processors) interconnected via communication channels: either message-passing (one-to-one direct channels) or shared-memory (shared communication channels). The literature often uses processes and processors interchangeably whenever the physical nature does not affect the problem.

Time Model

The time model determines what a process can infer about the state of other processes:

ModelPropertiesImplication
SynchronousBound Δ on message delay; bound Φ on relative process speedAccurate failure detection is possible: if no response within Δ + Φ, the process has likely crashed
AsynchronousNo bounds on message delays or process speedsImpossible to distinguish a crashed process from an extremely slow one

The asynchronous model is the most general and realistic for Internet-scale systems — networks can experience congestion, and processes can be slowed by load. However, it is also the model in which consensus becomes impossible (as FLP shows).

Crucial distinction

In an asynchronous system, you can never be certain whether a process has crashed or is merely slow. This single fact is the root of the FLP impossibility result.

Failure Model

Processes (or processors) may fail in diverse ways, and communication links may fail too. A fault-tolerant system is one that keeps working beyond faults of any sort. The key failure types relevant to consensus:

Paxos (and most practical consensus algorithms) assumes crash failures only — Byzantine failures require more heavyweight protocols.

3. The Consensus Problem

Consensus is the process by which we reach agreement over system state between unreliable machines connected by (possibly) asynchronous networks. It is the most fundamental agreement problem in distributed systems.

Why Consensus?

When processes fail, the reliable processes need to agree on:

This ensures that the current system state and its evolution are both consistent across all non-faulty replicas. For example, when a crash makes replica synchronisation impossible, a recovery process requires consensus among the surviving replicas about the state of each replica and the events handled or still to handle.

Formal Requirements

A consensus protocol must satisfy three properties:

PropertyMeaning
AgreementAll non-faulty processes decide on the same value
ValidityIf all non-faulty processes propose the same value v, then any decided value must be v
TerminationEvery non-faulty process eventually decides some value
The examiner will ask

Explain why termination is the property that makes consensus hard or even impossible in asynchronous systems. Hint: it requires detecting that other processes have finished, which needs some form of synchrony assumption.

Non-triviality

Besides the core properties, we also require non-triviality: for both y = 0 and y = 1, there exists some initial configuration and protocol execution that leads to y. This rules out trivial protocols that always decide 0 regardless of input.

Additional dependency requirements can refine the problem:

4. Variants of Consensus

The consensus problem appears in several forms depending on the specific requirements [Fischer, 1983]. Understanding these variants helps identify which solution applies to a given practical setting.

The most basic form: achieving consensus on a single bit. There are n processors, possibly faulty. Each processor i has an initial bit xi. The problem is for the non-faulty processes to agree on a bit y, called the consensus value. Solvability depends on whether a protocol exists such that all non-faulty processors terminate with the same value y.

Analogous to single-bit consensus, except the goal is for non-faulty processes to agree on a vector y. The dependency requirements are:

  • Strong: for each non-faulty i, yi = xi
  • Weak: same, but only if no failures occur during execution

This is consensus about what every processor thinks the initial values are.

Here, one specific processor (the general) tries to send its initial bit x to all others. All reliable processes must reach consensus on bit y with:

  • Strong: y = x if the general is not faulty
  • Weak: y = x if no failures occur during execution

This is also known as the reliable broadcast problem, where the general is called the transmitter.

In distributed databases [Dolev and Strong, 1982], all data manager processes that participated in a given transaction must agree on whether to install the transaction's results or to discard them. Discarding may be necessary if some data managers were unable to carry out the required processing. Whatever decision is made, all data managers must make the same decision to preserve database consistency.

This is the classic two-phase commit (2PC) problem — though 2PC can block under failures, which is why consensus-based approaches like Paxos improve upon it.

Unifying view

Many technical and real-world problems can be reduced to some version of the consensus problem. This abstract representation lets us reason about them in a general, synthetic way and provides general solutions.

5. Consensus & Fault Tolerance

Consensus is particularly important for providing a distributed system with some level of fault tolerance because it is essential to ensure the consistency of replicas.

When a crash makes it impossible to ensure replica synchronisation, a recovery process involves consensus among the many replicas about the state of each replica and the events handled or still to handle. In essence:

Fundamental tension

Reaching agreement is straightforward if the participating processes and the network are completely reliable. However, real systems face process crashes, network partitioning, lost, distorted, or duplicated messages — possibly including Byzantine failures where faulty processes go completely out of control or behave malevolently.

Any protocol can be overwhelmed by too frequent or severe faults. The best we can hope for is a protocol tolerant to a prescribed number of "expected" faults. Paxos, for example, tolerates up to f crash failures with n ≥ 2f + 1 processes.

6. Impossibility Results: FLP

Distributed computing is rich with impossibility results — "hundreds of impossibility results for distributed computing" [Fich and Ruppert, 2003]. These results are not just academic curiosities: they define the boundary of what is solvable and are essential for understanding the inherent difficulty of distributed problems.

The FLP Theorem

The most famous impossibility result in distributed computing is the FLP theorem (Fischer-Lynch-Paterson, 1985):

FLP Impossibility

In a fully asynchronous system, no consensus protocol can tolerate even one single crash failure under the mere requirement of non-triviality.

Let us be precise about what FLP does and does not say:

The examiner will ask

Explain the role of asynchrony in the FLP impossibility. Why does the theorem not apply to synchronous systems? Because in synchronous systems, timeouts allow failure detection, which enables the protocol to make progress despite crashes.

Intuition Behind FLP

The core insight is captured by the concept of bivalence. A state is bivalent if both outcomes (0 and 1) are still possible from that state. A state is univalent if only one outcome is possible.

FLP proves that:

  1. Any consensus protocol must start in a bivalent state (otherwise the protocol would be trivial, always deciding the same value)
  2. From any bivalent state, some step (receiving a message) leads to another bivalent state
  3. If a process crashes, the remaining processes cannot distinguish between "crashed" and "slow" — so they might block waiting for a message that never arrives
  4. This makes it impossible to guarantee termination (the third required property of consensus)

In essence: in an asynchronous system, a crashed process looks identical to a very slow one. The system can either wait forever (losing termination) or guess incorrectly (losing agreement). You cannot have both guarantees.

7. Interactive: FLP State Explorer

Explore the state space of a consensus protocol in an asynchronous system. The FLP theorem shows that once a process crashes, the system either blocks indefinitely (cannot tell crash from delay) or must guess — and a wrong guess leads to inconsistency. Click states to explore the transitions.

Editor's note

This state explorer abstracts the FLP proof. The key observation is the CRASH_OCCURSTIMEOUT_WAIT transition: once a process is suspected of crashing in an asynchronous system, the remaining processes can never be certain, which forces the protocol into deadlock unless it can fall back on some synchrony assumption.

8. Making Impossible Possible

FLP tells us that consensus is impossible in a fully asynchronous system with even one crash. But real-world distributed systems do solve consensus every day. How?

Solutions to FLP

In practice, we accept that sometimes the system will not be available and mitigate this with timers and backoffs. In theory, we make weaker assumptions about synchrony:

Key idea

The FLP impossibility does not mean we should give up. It defines the boundary of what is possible: we need to relax either the asynchrony assumption or the fault-tolerance requirement. Practical consensus algorithms like Paxos relax asynchrony by assuming eventual synchrony — the system may be asynchronous for arbitrary periods, but eventually becomes synchronous long enough to reach agreement.

This is the theme of Howard's (2016) talk "Distributed Consensus: Making Impossible Possible" — how to construct resilient distributed systems on top of unreliable components, starting from Lamport's work on organising parliament for a Greek island (Paxos), through to today's datacenters powering Google, Amazon, and Microsoft.

9. Paxos: Overview & Roles

Paxos [Lamport, 1998] is the most influential consensus algorithm. First presented in 1989 as a technical report by Leslie Lamport, it was recast in a more academically-viable form in 1998. It is used extensively inside Google (Chubby, Google File System, Cloud Bigtable) and many other large-scale systems.

Key Features

FeatureDetail
Fault toleranceTolerates up to f crash failures where n ≥ 2f + 1
NetworkCompletely connected network of n processes
Failure modelCrash failures and message loss; Byzantine failures excluded
GuaranteesAlways ensures agreement and validity
TerminationEnsured only if there is a sufficiently long interval with no protocol restarts

Process Roles

Paxos defines three logical roles that a process may play:

RoleResponsibility
ProposerSubmits proposed values on behalf of clients. Initiates the consensus protocol.
AcceptorDecides the candidate values for the final decision. Acceptors form a quorum — a bare majority must agree for a value to be chosen.
LearnerCollects information from the acceptors and reports the final decision back to the clients.
Why quorum?

Using a majority quorum (not all processes) is what gives Paxos its fault tolerance: up to f processes can crash, and the remaining n - ff + 1 can still form a majority. Any two majorities always intersect, ensuring that a value once chosen remains the only possible choice.

Basic Idea

Paxos is fundamentally a two-phase protocol:

  1. Prepare phase: figure out whether any value has already been chosen
  2. Accept phase: propose the appropriate value (either a previously chosen one or a new one)

The protocol uses monotonically increasing sequence numbers to label proposals, ensuring that later proposals can always override earlier ones if needed.

10. Paxos Phases: Prepare & Promise

A proposal sent by a proposer is a pair (v, n) where v is a value and n is a sequence number. To deal with possible crashes, multiple acceptors are selected for the proposal. The sequence number distinguishes successive attempts to invoke the protocol.

Phase 1: Prepare (Preparatory Phase)

  1. Each proposer sends a proposal (v, n) to each acceptor
  2. If n is the largest sequence number received by an acceptor so far, the acceptor sends an acknowledgement Promise(n, ⊥, ⊥) back to the proposer
  3. This Promise is a commitment: the acceptor will ignore all proposals numbered lower than n
  4. If the acceptor has already accepted a proposal with sequence number n' < n and value v, it responds with Promise(n, v, n') — signalling that the proposer should use v as the proposed value
sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3
    P->>A1: Prepare(1)
    P->>A2: Prepare(1)
    P->>A3: Prepare(1)
    A1-->>P: Promise(1, null)
    A2-->>P: Promise(1, null)
    Note over P: Majority (2 of 3) promises received
    A3--xP: (crashed - no response)
    Note over P: Can proceed with 2/3 promises
    
The examiner will ask

Why does the acceptor need to report a previously accepted value v when responding to a Prepare? Because if a value v was already accepted by some majority, the new proposal must carry forward v to preserve the agreement guarantee — otherwise, two different majorities might choose different values.

11. Paxos Phases: Accept & Learn

Phase 2: Accept (Request for Acceptance)

  1. If a proposer receives Promise(n, ⊥, ⊥) from a majority of acceptors, it sends Accept(v, n) to all acceptors, asking them to accept this value
  2. If, however, an acceptor returned Promise(n, v, n') in Phase 1 (meaning it already accepted v), then the proposer must include the value v with the highest sequence number in its Accept request
  3. An acceptor accepts a proposal (v, n) unless it has already promised to consider a proposal with sequence number n'' > n
sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3
    P->>A1: Accept("X", 1)
    P->>A2: Accept("X", 1)
    P->>A3: Accept("X", 1)
    A1-->>P: Accepted("X", 1)
    A2-->>P: Accepted("X", 1)
    Note over P: Majority accepted! Value "X" is chosen.
    A3--xP: (crashed)
    

Phase 3: Learn (Final Decision)

  1. When a majority of the acceptors accept a proposed value, it becomes the final decision value
  2. The acceptors multicast the accepted value to the learners
  3. Learners determine if a proposal has been accepted by a majority
  4. Learners convey the decision to the client processes invoking the consensus
sequenceDiagram
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant L1 as Learner 1
    participant L2 as Learner 2
    participant C as Client
    A1->>L1: Accepted("X", 1)
    A1->>L2: Accepted("X", 1)
    A2->>L1: Accepted("X", 1)
    A2->>L2: Accepted("X", 1)
    Note over L1,L2: Majority detected: value "X"
    L1->>C: Decision: X
    L2->>C: Decision: X
    Note over C: Both learners report same value
    

Paxos Observations

Livelock possibility

Theoretically, Paxos may suffer from livelock: if two proposers keep issuing Prepare messages with increasing sequence numbers before the other can complete, neither succeeds. In practice, leader election (choosing a single distinguished proposer) prevents this — this is what Multi-Paxos and Raft do.

12. Interactive: Paxos Consensus Stepper

Step through a complete Paxos consensus round. With 3 acceptors (one simulated as crashed), you'll need at least 2 promises and 2 accepted responses to constitute a majority. Click each process button to advance its step.

13. State Machine Replication

State Machine Replication (SMR) [Schneider, 1990] is the general method for implementing fault-tolerant services in distributed systems. It connects directly to consensus:

Key insight

Replication relates to consensus because all (distributed) replicas of a resource need to agree on the same state over time. SMR shifts the focus from consensus over distributed state to consensus over distributed actions on state.

How SMR Works

  1. The same state machine (deterministic) is replicated over a distributed system
  2. Each replica has a consensus module
  3. Any operation on any replicated machine is performed only if and when all machines have agreed on the same ordering of execution
  4. The assumption of determinism is critical: given the same initial state and the same sequence of inputs in the same order, each replica produces the same outputs and arrives at the same final state
flowchart LR
    C1[Client 1] -->|Request| SM[State Machine
Replication Layer] C2[Client 2] -->|Request| SM SM -->|Consensus on order| CO[Consensus Module] CO -->|Ordered log| R1[Replica 1
State Machine] CO -->|Ordered log| R2[Replica 2
State Machine] CO -->|Ordered log| R3[Replica 3
State Machine] R1 -->|Same output| C1 R2 -->|Same output| C1 R3 -->|Same output| C1

In SMR, the consensus problem shifts from "what is the current value?" to "in what order should operations be applied?" — this is called total order broadcast or atomic broadcast. Once all replicas agree on the order of operations, they produce identical state transitions deterministically.

14. Multi-Paxos & Beyond

Multi-Paxos

Paxos as described so far solves consensus for a single value. But SMR needs consensus over an array of values — a log of operations. Multi-Paxos [Van Renesse and Altinbuken, 2015] extends basic Paxos in two ways:

With a stable leader, the Prepare phase can be skipped for subsequent log entries (the leader is already known and trusted), making Multi-Paxos efficient in the common case — only the Accept phase runs.

Editor's note

Multi-Paxos is "not really well-specified overall" [Van Renesse and Altinbuken, 2015]; this fuzziness motivated the development of Raft, which provides a clearer specification with the same guarantees.

Other Paxos-Based Algorithms

AlgorithmKey contribution
Fast Paxos [Lamport, 2006]Reduces message delay by allowing acceptors to propose values directly
ZooKeeper / ZAB [Reed and Junqueira, 2008]ZooKeeper Atomic Broadcast — used by Apache ZooKeeper for coordination
Egalitarian Paxos [Moraru et al., 2013]All nodes are equal (no distinguished leader) for better load distribution
Raft [Ongaro and Ousterhout, 2014]Designed for understandability; uses leader election and log replication

Real-World Example: Chubby

Google's Chubby [Burrows, 2006] is a fault-tolerant distributed lock service that uses Paxos internally. It is used extensively inside Google systems as a reliable lock service and nameserver, powering Google File System (GFS), Cloud Bigtable, and other critical infrastructure. The engineering experience of building Chubby taught the industry many practical lessons about running Paxos in production [Chandra et al., 2007].

15. Lessons Learnt & Conclusion

General Lessons

Specific Lessons

Takeaway

Consensus is at the heart of fault-tolerant distributed computing. The FLP impossibility tells us there is no free lunch: any practical solution must relax some assumption (synchrony, determinism, or failure model). Paxos does this by assuming eventual synchrony and tolerating only crash failures, and by guaranteeing safety under all conditions while liveness depends on sufficiently long periods of stability.

Check Your Understanding

Explain the difference between the synchronous and asynchronous time models. Why does this distinction matter for consensus?

In the synchronous model, there is a known bound Δ on message transmission delay and a known bound Φ on relative process speed. This allows accurate failure detection: if no response arrives within Δ + Φ time, the process can be assumed to have crashed. In the asynchronous model, no such bounds exist — a process can never distinguish between a crashed process and an extremely slow one. This distinction matters because FLP proves that consensus is impossible in the asynchronous model with even one crash failure, whereas synchronous systems can solve consensus (e.g., using timeouts).

State the FLP impossibility theorem precisely. What are its assumptions and what does it prove?

FLP Theorem (Fischer, Lynch, Paterson, 1985): In a fully asynchronous distributed system, no deterministic consensus protocol can tolerate even one crash failure under the non-triviality requirement. Assumptions: reliable message delivery (no loss, no duplication), no Byzantine failures, asynchronous communication (unbounded delays), processes can only fail by crashing. It proves that any protocol that guarantees agreement and validity must violate termination in some execution where a single process crashes at an inopportune moment. The proof uses the concept of bivalent states (where both outcomes are still possible) and shows that from any bivalent state, there is always a sequence of events leading to another bivalent state, preventing the protocol from ever deciding.

Describe the three roles in Paxos and their responsibilities. Why are these roles logical rather than physical?

The three logical roles are: Proposer — submits proposed values on behalf of clients, initiates the protocol; Acceptor — decides which values become candidates for the final decision, forms majority quorums; Learner — collects accepted values from acceptors and reports the final decision to clients. These are logical roles because a single physical process may play multiple roles simultaneously (e.g., a server might act as both acceptor and learner). This separation of concerns allows each role to be optimised independently and allows the protocol to be deployed flexibly.

Why does Paxos require n ≥ 2f + 1 processes to tolerate f crash failures?

Paxos uses majority quorums. To tolerate f crash failures, there must be n - f processes still alive. For any two quorums to intersect (which is necessary for correctness — a value chosen by one quorum must be known to the next), we need 2(n - f) > n, which simplifies to n ≥ 2f + 1. Intuitively, each majority quorum has size ⌊n/2⌋ + 1, and with n ≥ 2f + 1, any two majorities always share at least one non-faulty process.

Explain the relationship between consensus and State Machine Replication (SMR).

SMR is a general method for building fault-tolerant services. A deterministic state machine is replicated across multiple servers. The key challenge is ensuring that all replicas process the same sequence of operations in the same order. This is precisely where consensus comes in: replicas use a consensus protocol to agree on the next operation to execute (or, equivalently, on the ordering of operations in a log). Consensus over single values is generalised to consensus over an ordered log of values via protocols like Multi-Paxos or Raft. Once all replicas agree on the order, determinism ensures they all reach the same final state.

What is the difference between basic Paxos and Multi-Paxos? How does leader election help?

Basic Paxos achieves consensus on a single value. Multi-Paxos extends this to an ordered sequence of values (a log) by running one Paxos instance per log entry. Multi-Paxos adds leader election: a distinguished proposer is chosen, which eliminates the possibility of conflicting proposals. With a stable leader, the Prepare phase can be skipped for subsequent log entries (since the leader is already known and trusted by a majority), reducing the protocol to a single message round-trip per entry. This makes Multi-Paxos efficient in the common case while preserving the same safety guarantees as basic Paxos.