Distributed systems are composed of multiple components that interact to achieve a common goal. However, each component has a partial and potentially divergent view of the system state: processes may observe events in different orders, communication delays create uncertainty, and failures introduce conflicting information.
The agreement problem captures this challenge: when the different components of a system have potentially divergent views over the system state, and about what is happening in the system, a coherent overall system behaviour can be achieved only when all the components come to agree about everything relevant for the system itself — state, events, and the ordering of events. The key point is that they reach the very same conclusion, independently of what that conclusion is about.
Agreement is about convergence to a shared understanding. It does not matter which specific value the system settles on — what matters is that all correct processes settle on the same value.
Agreement problems appear in many practical scenarios [Fischer, 1983]:
| Domain | Agreement needed |
|---|---|
| Distributed databases | Data managers must agree on whether to commit or abort a distributed transaction |
| Replicated file systems | Nodes must agree on where file copies reside |
| Flight control systems | Engine control and flight surface modules must agree on continue/abort for landing |
| Blockchain / ledgers | Validators must agree on the next block and its ordering |
| Clock synchronisation | Processes must agree on the current time despite drift |
What makes these problems non-trivial is the presence of faults: processes may crash, messages may be lost or delayed, and the system may be subject to unpredictable load. An agreement protocol must work correctly despite these faults — this is the essence of fault-tolerant distributed computing.
To reason precisely about consensus, we need a shared vocabulary for the key dimensions of distributed systems.
Non-trivial computational systems are concurrent or distributed — systems where more than one computation occurs either logically or physically at the same time. These systems are modelled as collections of computing (logical processes) or (physical processors) interconnected via communication channels: either message-passing (one-to-one direct channels) or shared-memory (shared communication channels). The literature often uses processes and processors interchangeably whenever the physical nature does not affect the problem.
The time model determines what a process can infer about the state of other processes:
| Model | Properties | Implication |
|---|---|---|
| Synchronous | Bound Δ on message delay; bound Φ on relative process speed | Accurate failure detection is possible: if no response within Δ + Φ, the process has likely crashed |
| Asynchronous | No bounds on message delays or process speeds | Impossible to distinguish a crashed process from an extremely slow one |
The asynchronous model is the most general and realistic for Internet-scale systems — networks can experience congestion, and processes can be slowed by load. However, it is also the model in which consensus becomes impossible (as FLP shows).
In an asynchronous system, you can never be certain whether a process has crashed or is merely slow. This single fact is the root of the FLP impossibility result.
Processes (or processors) may fail in diverse ways, and communication links may fail too. A fault-tolerant system is one that keeps working beyond faults of any sort. The key failure types relevant to consensus:
Paxos (and most practical consensus algorithms) assumes crash failures only — Byzantine failures require more heavyweight protocols.
Consensus is the process by which we reach agreement over system state between unreliable machines connected by (possibly) asynchronous networks. It is the most fundamental agreement problem in distributed systems.
When processes fail, the reliable processes need to agree on:
This ensures that the current system state and its evolution are both consistent across all non-faulty replicas. For example, when a crash makes replica synchronisation impossible, a recovery process requires consensus among the surviving replicas about the state of each replica and the events handled or still to handle.
A consensus protocol must satisfy three properties:
| Property | Meaning |
|---|---|
| Agreement | All non-faulty processes decide on the same value |
| Validity | If all non-faulty processes propose the same value v, then any decided value must be v |
| Termination | Every non-faulty process eventually decides some value |
Explain why termination is the property that makes consensus hard or even impossible in asynchronous systems. Hint: it requires detecting that other processes have finished, which needs some form of synchrony assumption.
Besides the core properties, we also require non-triviality: for both y = 0 and y = 1, there exists some initial configuration and protocol execution that leads to y. This rules out trivial protocols that always decide 0 regardless of input.
Additional dependency requirements can refine the problem:
The consensus problem appears in several forms depending on the specific requirements [Fischer, 1983]. Understanding these variants helps identify which solution applies to a given practical setting.
The most basic form: achieving consensus on a single bit. There are n processors, possibly faulty. Each processor i has an initial bit xi. The problem is for the non-faulty processes to agree on a bit y, called the consensus value. Solvability depends on whether a protocol exists such that all non-faulty processors terminate with the same value y.
Analogous to single-bit consensus, except the goal is for non-faulty processes to agree on a vector y. The dependency requirements are:
This is consensus about what every processor thinks the initial values are.
Here, one specific processor (the general) tries to send its initial bit x to all others. All reliable processes must reach consensus on bit y with:
This is also known as the reliable broadcast problem, where the general is called the transmitter.
In distributed databases [Dolev and Strong, 1982], all data manager processes that participated in a given transaction must agree on whether to install the transaction's results or to discard them. Discarding may be necessary if some data managers were unable to carry out the required processing. Whatever decision is made, all data managers must make the same decision to preserve database consistency.
This is the classic two-phase commit (2PC) problem — though 2PC can block under failures, which is why consensus-based approaches like Paxos improve upon it.
Many technical and real-world problems can be reduced to some version of the consensus problem. This abstract representation lets us reason about them in a general, synthetic way and provides general solutions.
Consensus is particularly important for providing a distributed system with some level of fault tolerance because it is essential to ensure the consistency of replicas.
When a crash makes it impossible to ensure replica synchronisation, a recovery process involves consensus among the many replicas about the state of each replica and the events handled or still to handle. In essence:
Reaching agreement is straightforward if the participating processes and the network are completely reliable. However, real systems face process crashes, network partitioning, lost, distorted, or duplicated messages — possibly including Byzantine failures where faulty processes go completely out of control or behave malevolently.
Any protocol can be overwhelmed by too frequent or severe faults. The best we can hope for is a protocol tolerant to a prescribed number of "expected" faults. Paxos, for example, tolerates up to f crash failures with n ≥ 2f + 1 processes.
Distributed computing is rich with impossibility results — "hundreds of impossibility results for distributed computing" [Fich and Ruppert, 2003]. These results are not just academic curiosities: they define the boundary of what is solvable and are essential for understanding the inherent difficulty of distributed problems.
The most famous impossibility result in distributed computing is the FLP theorem (Fischer-Lynch-Paterson, 1985):
In a fully asynchronous system, no consensus protocol can tolerate even one single crash failure under the mere requirement of non-triviality.
Let us be precise about what FLP does and does not say:
Explain the role of asynchrony in the FLP impossibility. Why does the theorem not apply to synchronous systems? Because in synchronous systems, timeouts allow failure detection, which enables the protocol to make progress despite crashes.
The core insight is captured by the concept of bivalence. A state is bivalent if both outcomes (0 and 1) are still possible from that state. A state is univalent if only one outcome is possible.
FLP proves that:
In essence: in an asynchronous system, a crashed process looks identical to a very slow one. The system can either wait forever (losing termination) or guess incorrectly (losing agreement). You cannot have both guarantees.
Explore the state space of a consensus protocol in an asynchronous system. The FLP theorem shows that once a process crashes, the system either blocks indefinitely (cannot tell crash from delay) or must guess — and a wrong guess leads to inconsistency. Click states to explore the transitions.
This state explorer abstracts the FLP proof. The key observation is the CRASH_OCCURS → TIMEOUT_WAIT transition: once a process is suspected of crashing in an asynchronous system, the remaining processes can never be certain, which forces the protocol into deadlock unless it can fall back on some synchrony assumption.
FLP tells us that consensus is impossible in a fully asynchronous system with even one crash. But real-world distributed systems do solve consensus every day. How?
In practice, we accept that sometimes the system will not be available and mitigate this with timers and backoffs. In theory, we make weaker assumptions about synchrony:
The FLP impossibility does not mean we should give up. It defines the boundary of what is possible: we need to relax either the asynchrony assumption or the fault-tolerance requirement. Practical consensus algorithms like Paxos relax asynchrony by assuming eventual synchrony — the system may be asynchronous for arbitrary periods, but eventually becomes synchronous long enough to reach agreement.
This is the theme of Howard's (2016) talk "Distributed Consensus: Making Impossible Possible" — how to construct resilient distributed systems on top of unreliable components, starting from Lamport's work on organising parliament for a Greek island (Paxos), through to today's datacenters powering Google, Amazon, and Microsoft.
Paxos [Lamport, 1998] is the most influential consensus algorithm. First presented in 1989 as a technical report by Leslie Lamport, it was recast in a more academically-viable form in 1998. It is used extensively inside Google (Chubby, Google File System, Cloud Bigtable) and many other large-scale systems.
| Feature | Detail |
|---|---|
| Fault tolerance | Tolerates up to f crash failures where n ≥ 2f + 1 |
| Network | Completely connected network of n processes |
| Failure model | Crash failures and message loss; Byzantine failures excluded |
| Guarantees | Always ensures agreement and validity |
| Termination | Ensured only if there is a sufficiently long interval with no protocol restarts |
Paxos defines three logical roles that a process may play:
| Role | Responsibility |
|---|---|
| Proposer | Submits proposed values on behalf of clients. Initiates the consensus protocol. |
| Acceptor | Decides the candidate values for the final decision. Acceptors form a quorum — a bare majority must agree for a value to be chosen. |
| Learner | Collects information from the acceptors and reports the final decision back to the clients. |
Using a majority quorum (not all processes) is what gives Paxos its fault tolerance: up to f processes can crash, and the remaining n - f ≥ f + 1 can still form a majority. Any two majorities always intersect, ensuring that a value once chosen remains the only possible choice.
Paxos is fundamentally a two-phase protocol:
The protocol uses monotonically increasing sequence numbers to label proposals, ensuring that later proposals can always override earlier ones if needed.
A proposal sent by a proposer is a pair (v, n) where v is a value and n is a sequence number. To deal with possible crashes, multiple acceptors are selected for the proposal. The sequence number distinguishes successive attempts to invoke the protocol.
sequenceDiagram
participant P as Proposer
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
P->>A1: Prepare(1)
P->>A2: Prepare(1)
P->>A3: Prepare(1)
A1-->>P: Promise(1, null)
A2-->>P: Promise(1, null)
Note over P: Majority (2 of 3) promises received
A3--xP: (crashed - no response)
Note over P: Can proceed with 2/3 promises
Why does the acceptor need to report a previously accepted value v when responding to a Prepare? Because if a value v was already accepted by some majority, the new proposal must carry forward v to preserve the agreement guarantee — otherwise, two different majorities might choose different values.
sequenceDiagram
participant P as Proposer
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
P->>A1: Accept("X", 1)
P->>A2: Accept("X", 1)
P->>A3: Accept("X", 1)
A1-->>P: Accepted("X", 1)
A2-->>P: Accepted("X", 1)
Note over P: Majority accepted! Value "X" is chosen.
A3--xP: (crashed)
sequenceDiagram
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant L1 as Learner 1
participant L2 as Learner 2
participant C as Client
A1->>L1: Accepted("X", 1)
A1->>L2: Accepted("X", 1)
A2->>L1: Accepted("X", 1)
A2->>L2: Accepted("X", 1)
Note over L1,L2: Majority detected: value "X"
L1->>C: Decision: X
L2->>C: Decision: X
Note over C: Both learners report same value
Theoretically, Paxos may suffer from livelock: if two proposers keep issuing Prepare messages with increasing sequence numbers before the other can complete, neither succeeds. In practice, leader election (choosing a single distinguished proposer) prevents this — this is what Multi-Paxos and Raft do.
Step through a complete Paxos consensus round. With 3 acceptors (one simulated as crashed), you'll need at least 2 promises and 2 accepted responses to constitute a majority. Click each process button to advance its step.
State Machine Replication (SMR) [Schneider, 1990] is the general method for implementing fault-tolerant services in distributed systems. It connects directly to consensus:
Replication relates to consensus because all (distributed) replicas of a resource need to agree on the same state over time. SMR shifts the focus from consensus over distributed state to consensus over distributed actions on state.
flowchart LR
C1[Client 1] -->|Request| SM[State Machine
Replication Layer]
C2[Client 2] -->|Request| SM
SM -->|Consensus on order| CO[Consensus Module]
CO -->|Ordered log| R1[Replica 1
State Machine]
CO -->|Ordered log| R2[Replica 2
State Machine]
CO -->|Ordered log| R3[Replica 3
State Machine]
R1 -->|Same output| C1
R2 -->|Same output| C1
R3 -->|Same output| C1
In SMR, the consensus problem shifts from "what is the current value?" to "in what order should operations be applied?" — this is called total order broadcast or atomic broadcast. Once all replicas agree on the order of operations, they produce identical state transitions deterministically.
Paxos as described so far solves consensus for a single value. But SMR needs consensus over an array of values — a log of operations. Multi-Paxos [Van Renesse and Altinbuken, 2015] extends basic Paxos in two ways:
With a stable leader, the Prepare phase can be skipped for subsequent log entries (the leader is already known and trusted), making Multi-Paxos efficient in the common case — only the Accept phase runs.
Multi-Paxos is "not really well-specified overall" [Van Renesse and Altinbuken, 2015]; this fuzziness motivated the development of Raft, which provides a clearer specification with the same guarantees.
| Algorithm | Key contribution |
|---|---|
| Fast Paxos [Lamport, 2006] | Reduces message delay by allowing acceptors to propose values directly |
| ZooKeeper / ZAB [Reed and Junqueira, 2008] | ZooKeeper Atomic Broadcast — used by Apache ZooKeeper for coordination |
| Egalitarian Paxos [Moraru et al., 2013] | All nodes are equal (no distinguished leader) for better load distribution |
| Raft [Ongaro and Ousterhout, 2014] | Designed for understandability; uses leader election and log replication |
Google's Chubby [Burrows, 2006] is a fault-tolerant distributed lock service that uses Paxos internally. It is used extensively inside Google systems as a reliable lock service and nameserver, powering Google File System (GFS), Cloud Bigtable, and other critical infrastructure. The engineering experience of building Chubby taught the industry many practical lessons about running Paxos in production [Chandra et al., 2007].
Consensus is at the heart of fault-tolerant distributed computing. The FLP impossibility tells us there is no free lunch: any practical solution must relax some assumption (synchrony, determinism, or failure model). Paxos does this by assuming eventual synchrony and tolerating only crash failures, and by guaranteeing safety under all conditions while liveness depends on sufficiently long periods of stability.
In the synchronous model, there is a known bound Δ on message transmission delay and a known bound Φ on relative process speed. This allows accurate failure detection: if no response arrives within Δ + Φ time, the process can be assumed to have crashed. In the asynchronous model, no such bounds exist — a process can never distinguish between a crashed process and an extremely slow one. This distinction matters because FLP proves that consensus is impossible in the asynchronous model with even one crash failure, whereas synchronous systems can solve consensus (e.g., using timeouts).
FLP Theorem (Fischer, Lynch, Paterson, 1985): In a fully asynchronous distributed system, no deterministic consensus protocol can tolerate even one crash failure under the non-triviality requirement. Assumptions: reliable message delivery (no loss, no duplication), no Byzantine failures, asynchronous communication (unbounded delays), processes can only fail by crashing. It proves that any protocol that guarantees agreement and validity must violate termination in some execution where a single process crashes at an inopportune moment. The proof uses the concept of bivalent states (where both outcomes are still possible) and shows that from any bivalent state, there is always a sequence of events leading to another bivalent state, preventing the protocol from ever deciding.
The three logical roles are: Proposer — submits proposed values on behalf of clients, initiates the protocol; Acceptor — decides which values become candidates for the final decision, forms majority quorums; Learner — collects accepted values from acceptors and reports the final decision to clients. These are logical roles because a single physical process may play multiple roles simultaneously (e.g., a server might act as both acceptor and learner). This separation of concerns allows each role to be optimised independently and allows the protocol to be deployed flexibly.
Paxos uses majority quorums. To tolerate f crash failures, there must be n - f processes still alive. For any two quorums to intersect (which is necessary for correctness — a value chosen by one quorum must be known to the next), we need 2(n - f) > n, which simplifies to n ≥ 2f + 1. Intuitively, each majority quorum has size ⌊n/2⌋ + 1, and with n ≥ 2f + 1, any two majorities always share at least one non-faulty process.
SMR is a general method for building fault-tolerant services. A deterministic state machine is replicated across multiple servers. The key challenge is ensuring that all replicas process the same sequence of operations in the same order. This is precisely where consensus comes in: replicas use a consensus protocol to agree on the next operation to execute (or, equivalently, on the ordering of operations in a log). Consensus over single values is generalised to consensus over an ordered log of values via protocols like Multi-Paxos or Raft. Once all replicas agree on the order, determinism ensures they all reach the same final state.
Basic Paxos achieves consensus on a single value. Multi-Paxos extends this to an ordered sequence of values (a log) by running one Paxos instance per log entry. Multi-Paxos adds leader election: a distinguished proposer is chosen, which eliminates the possibility of conflicting proposals. With a stable leader, the Prepare phase can be skipped for subsequent log entries (since the leader is already known and trusted by a majority), reducing the protocol to a single message round-trip per entry. This makes Multi-Paxos efficient in the common case while preserving the same safety guarantees as basic Paxos.