Distributed Systems — Prof. Omicini

Logging & Checkpointing

DS-C2 — Towards Recovery of Distributed SystemsAndrea Omicini — DISI, Univ. BolognaA.Y. 2025/2026

In this lesson

1. Interaction, Dependencies & Causality

Components of a distributed system play different roles: they have functions to perform, services to provide, tasks to execute, and goals to achieve. But no component works in isolation. Components interact — and through interaction, they depend on each other.

Interaction happens across both time (in concurrent systems) and space (in distributed systems). Two processes on different machines exchange messages; the sender's state before the send and the receiver's state after the receive are causally linked. If the sender crashes before sending, the receiver's subsequent behaviour becomes invalid. This is the core reason why recovery is hard.

Key insight

Dependencies between components span across process contexts — over time in concurrent systems, over space in distributed systems. Recovery mechanisms must respect these dependencies or risk reconstructing a state that never existed.

How do we model dependencies? One (mischievously) simple way is causality. Two events are causally related if one could have influenced the other. In distributed systems, causality is the fundamental building block for reasoning about what states are reachable and which checkpoints form a coherent global picture.

2. Causality in Science and Computing

The "cause-effect" link is a basic cognitive tool humans use to understand and predict the dynamics of reality. In daily life, we use motivations and goals to explain the course of actions — what is called mind reading. But can science — and computer science in particular — adopt a precise, operational notion of cause?

Note

Interestingly, the word "cause" was common in 19th-century physics paper titles but nearly absent in their bodies. Science struggles with causality too — which is why Judea Pearl's Causality: Models, Reasoning, and Inference (2009) is essential reading for understanding the formal foundations.

The famous warning "correlation is not causation" reminds us that mere association of effects can be a confounding factor. For example, suppose a carcinogenic genotype leads to both lung cancer and a predisposition to smoke. A strong statistical correlation between smoking and lung cancer would not, by itself, prove a causal link — the tobacco industry famously used this argument. In Pearl's framework, correlation is a statistical concept while causation is a probabilistic one, precisely defined through causal models and do-calculus.

For distributed systems, we fortunately do not need the full philosophical apparatus. We need a narrower, operational notion: one event causally precedes another if information flows from the first to the second. This gives us a practical handle on dependencies between processes.

3. Causality, Time, and Distributed Ordering

In distributed systems, a causal link determines a temporal relationship between two events: a cause temporally precedes its effect. This is an oversimplified notion — but it has its use.

Time is an essential issue in distributed systems. Without a shared clock, we cannot simply assign timestamps and sort events. The fundamental observation is:

This loss of total order is the defining technical challenge of distributed system recovery. When reconstructing a global state from individual process checkpoints, we must respect causality: we cannot combine a checkpoint where P0 has sent message m with a checkpoint where P1 has received m but P0 has not. That would produce a state that was never reachable — an inconsistent global state.

The examiner will ask

Explain why a global state constructed from arbitrary per-process checkpoints might be inconsistent. Use causality and the partial order of distributed events in your answer.

sequenceDiagram
    participant P0 as Process P0
    participant P1 as Process P1
    participant P2 as Process P2
    P0->>P1: m0 (causal link)
    P1->>P2: m1 (causal link)
    Note over P0,P2: A checkpoint set {C0, C1', C2}
where C0 captures "before m0"
and C1' captures "after receiving m0"
is INCONSISTENT
because C1' depends on information
not present in C0

4. Checkpointing: The Fundamental Dependability Technique

Checkpointing and logging are the most fundamental techniques for achieving dependability in distributed systems. They provide a path towards system recovery after failure, offering a form of fault tolerance that is:

Definition

A checkpoint of a distributed system is a copy of the system state saved to stable storage — storage that survives the faults the system is designed to tolerate. If the checkpoint is available after a failure, the system can be recovered to the state captured by that checkpoint.

Checkpointing refers to the action of periodically taking a copy of the system state and saving it to stable storage. The frequency and coordination of checkpoints determine both the runtime overhead and the amount of work that may be lost upon failure.

Limitation

If only checkpointing is used, some information may be lost when a fault occurs. For example, any state changes that happened after the last checkpoint are gone. The recovery time is typically larger than that of more sophisticated approaches. This is why we pair checkpointing with logging.

5. Recovery: Why Checkpoints Alone Are Not Enough

To recover a system to the state right before it fails, additional recovery information must be logged beyond periodic checkpoints. Specifically:

Together, checkpointing and logging provide a form of rollback recovery: they restore the system to a state prior to the failure, then replay logged events to reach the pre-failure state.

Editor's note

Mechanisms for roll-forward recovery exist — they reconstruct an alternative correct state rather than rewinding to a previous one. However, they typically incur higher runtime overhead and demand more physical resources, making rollback recovery the more common choice in practice.

flowchart LR
    S[System Running] --> CP[Periodic Checkpointing]
    CP --> LG[Message Logging]
    LG --> N[Normal Operation]
    N --> F[FAILURE]
    F --> RB[Rollback to Last Checkpoint]
    RB --> RP[Replay Logged Messages]
    RP --> S2[System Restored
to Pre-Failure State] S2 --> CP style F fill:#fca5a5,stroke:#991b1b style S2 fill:#86efac,stroke:#065f46

6. System Model and Fault Model

Before designing recovery protocols, we must define the system model precisely.

System Model

Fault Model

Key assumption

The fail-stop model is both realistic (a crashed process does stop) and convenient (we do not need to handle Byzantine behaviour). Under this model, the only loss on failure is the volatile state of the crashed process — messages may still be in transit, and checkpoints on stable storage survive.

The figure below illustrates a typical system: four processes P0 through P3 exchanging messages, with an input arriving from the outside world and an output emitted as the corresponding response.

flowchart LR
    subgraph Outside World
        I[Input] --> P0
        P3 --> O[Output]
    end
    subgraph Distributed System
        P0 -->|m0| P1
        P1 -->|m1| P2
        P2 -->|m2| P3
        P3 -->|m3| P0
        P1 -->|m4| P3
        P0 -->|m5| P2
    end
    P0[P0] --- P1[P1]
    P2[P2] --- P3[P3]
    P0 --- P3
    P1 --- P2

7. Process State and Global State

The state of a process is defined by its entire address space in the operating system. A generic checkpointing library simply saves the entire address space. However, application semantics can often be exploited to define a smaller, more specific notion of process state.

The global state of a distributed system includes the state of every process. But aggregation alone is not enough: the states of different processes are related through message exchanges. Information flows between processes, causing state changes — processes causally depend on each other. Therefore:

Critical point

Dependencies cannot be lost in the global state. A checkpoint set that ignores causal dependencies may be inconsistent and therefore useless for recovery.

The table below summarises the components of the system state:

ComponentDescription
Process stateEntire address space of a process (or a semantically-relevant subset)
Channel stateSet of messages in transit along a channel — sent but not yet received
Global stateSet of process states + set of channel states for all channels

8. Consistent vs. Inconsistent Global States

The slides present three critical scenarios for global states. Understanding the difference between them is essential for designing recovery protocols.

Inconsistent global state: Checkpoints taken by different processes are incompatible — they cannot be used to recover the system.

Example: C1 reflects reception of m0, but C0 does not show m0 as sent. If P0 crashes after m0, recovery from {C0, C1} would make $100 appear from nowhere.

A global state formed from a wrong set of checkpoints is not admissible — it is not reachable from the initial state of the system through any valid execution.

Consistent global state: Checkpoints are compatible and can be used for recovery.

Example: C1 reflects reception of m0, and C0 also records m0 as sent. If P0 crashes after C0 and P1 after C1, recovery from {C0, C1} correctly moves $100 from A to B. The same holds for C1, C2 and m1.

Consistency ensures that for every received message, its sending is also recorded — the global state respects causality.

Consistent yet unrecoverable: Checkpoints are compatible (they represent a reachable state) but cannot be used for recovery because messages in transit would be lost.

Example: $50 in transit via m1 would disappear on recovery. The problem is the loss of channel state.

To solve this, the system model must include channel state — the set of messages in transit. If m0 is saved in C0 as channel state and m1 in C1 the same way, recovery from {C0, C1} becomes possible.

sequenceDiagram
    participant P0 as P0 (Account A)
    participant P1 as P1 (Account B)
    P0->>P1: m0: $100 deposit
    Note over P0: Checkpoint C0
(A=$400, m0 sent) Note over P1: Checkpoint C1
(B=$400, m0 received) Note over P0,P1: Consistent: C0 records send, C1 records receive Note over P0,P1: Inconsistent: C0 does NOT record send,
C1 DOES record receive

9. Channel State and the Refined Model

To handle scenario (c) — consistent checkpoints with lost messages in transit — we refine the system model:

Example

A TCP connection between two processes comprises two channels — one in each direction. Each channel has its own state (messages in transit in that direction).

With channel state included in the global state, scenario (c) becomes recoverable: if m0 is saved in C0's channel state (for the channel P0→P1) and m1 in C1's channel state (for the channel P1→P2), then recovery from {C0, C1} correctly preserves all messages.

10. The Piecewise Deterministic Assumption

Checkpoint-based protocols recover the system up to the most recent consistent global state. All execution after that state is lost. This is a fundamental limitation — no matter how frequently checkpoints are taken, there is always a window of lost computation between the last checkpoint and the failure.

Log-based protocols overcome this limitation by assuming the piecewise deterministic (PWD) assumption:

Piecewise Deterministic (PWD) Assumption

Each state interval evolves deterministically until a non-deterministic event occurs (such as message reception). All non-deterministic events can be identified, and enough information about each is logged so that the interval can be deterministically replayed.

Under the PWD assumption, the execution of a process is modelled as consecutive state intervals. Each interval is initiated by a non-deterministic event and followed by a sequence of deterministic state changes. If the initiating event is logged, the entire interval can be replayed — allowing recovery to a state right before the failure, not just the last checkpoint.

Trade-off

Checkpoint-based protocols do not need the PWD assumption — they are simpler to implement and less restrictive. But they lose post-checkpoint execution. Log-based protocols recover more precisely but require identifying all non-deterministic events and logging them properly.

11. Output Commit and Stable Storage

Output Commit

A distributed system interacts with the outside world — clients, services, users. Once an output is emitted, a portion of the system state becomes observable externally, representing a sort of commitment or achievement. This introduces the output commit problem:

Output Commit Problem

If a failure occurs after an output is emitted, the outside world cannot be relied upon to "un-see" that output. Therefore, observable consistency requires that enough recovery information is logged before the system commits to an output — otherwise a rollback would make the output inconsistent with the recovered internal state.

Stable Storage

All checkpointing and logging protocols require stable storage — storage that survives process failures and remains available upon recovery. The choice of stable storage depends on the fault tolerance required:

Failure TypeStable Storage
Process failures onlyLocal disks
Disk failuresRedundant disks (RAID-1, RAID-5)
Site-wide failuresReplicated or cloud-based file systems
Note

Checkpoints and logged messages must always be stored in stable storage. The stability requirement is absolute — if a checkpoint is lost, the entire recovery mechanism fails.

12. Uncoordinated Checkpointing

In uncoordinated checkpointing, each process autonomously decides when to take a checkpoint. The intuition is straightforward: when a failure occurs, the recovery process searches backwards to select the most recent set of consistent checkpoints.

However, this autonomy creates problems:

The examiner will ask

Explain the domino effect in uncoordinated checkpointing: why might rolling back one process force a cascade of rollbacks across other processes? How does coordinated checkpointing prevent this?

13. Tamir & Sequin Coordinated Checkpointing

The Tamir and Sequin global checkpointing protocol (1984) is a coordinated, blocking protocol. It solves the consistency problem by having processes coordinate their checkpoints so that they always form a consistent global state.

Roles

Control Messages

MessagePurpose
CHECKPOINTInitiates a global checkpoint round; also used to establish a quiescent point where all processes have stopped normal execution
SAVEDInforms the coordinator that a participant has completed its local checkpoint
FAULTSignals a timeout — the current round should be aborted
RESUMETells participants they can resume normal execution

Protocol Phases (Two-Phase Commit Style)

  1. Phase 1 (Quiesce): The coordinator stops execution and sends CHECKPOINT to all processes. Processes stop, send CHECKPOINT through all outgoing channels, and wait for CHECKPOINT from all incoming channels. This creates a quiescent point.
  2. Phase 2 (Commit): Once the CHECKPOINT certificate is complete, each process takes a local checkpoint and sends SAVED. The coordinator waits for all SAVED messages, then atomically switches to the new checkpoint set and sends RESUME. Processes resume normal execution.
Blocking nature

Normal execution is suspended during each round of global checkpointing. This is the protocol's main drawback: it trades availability for consistency. If a participant fails to respond, the entire round is aborted with FAULT messages.

Finite State Machine

The coordinator and each participant follow a deterministic FSM during the protocol. Below is the annotated pseudocode for the coordinator:

And for the participant:

Interactive Simulation

Step through the protocol below to see how the coordinator and participants interact. Click each process to advance it through its state machine.

14. Chandy & Lamport Distributed Snapshot

The Chandy and Lamport distributed snapshot protocol (1985) is a non-blocking alternative: normal execution is not interrupted during global checkpointing. However, it only concerns itself with producing a consistent global snapshot — it prescribes no mechanism for determining the end of the checkpointing round or atomically switching to a new global checkpoint.

Protocol Description

sequenceDiagram
    participant P0 as P0 (Initiator)
    participant P1 as P1
    participant P2 as P2
    P0->>P0: Take local snapshot
    P0->>P1: Marker
    P0->>P2: Marker
    Note over P1: Record channel state
from P0 P1->>P2: Marker Note over P2: Record channel state
from P0 and P1 P1-->>P0: (implicit) done P2-->>P1: (implicit) done P1-->>P0: (implicit) done Note over P0: All certificates complete
Global snapshot recorded
Key difference

The Chandy-Lamport protocol does not block normal execution. Processes continue processing messages while the snapshot is being taken. Messages received after a Marker on a given channel are treated as "post-snapshot" and not included in that channel's state.

15. Checkpointing Protocols Compared

Both protocols share the same system model and use special control messages to propagate and coordinate global checkpointing. They both recognise the need to capture channel state to ensure recoverability, and the mechanism for capturing it is virtually identical. Their communication overhead is also the same.

AspectTamir & SequinChandy & Lamport
ExecutionBlocking — processes stop during checkpointingNon-blocking — normal execution continues
CompletenessFull: atomic commit/abort of the global checkpoint roundPartial: only produces the snapshot; no atomic switch mechanism
RobustnessMore robust — includes timeout and FAULT handlingLess robust — no built-in abort mechanism
InitiationCoordinator-driven (single initiator)Any process can initiate (autonomous)
Consistency guaranteeStrong — two-phase commit ensures atomicityStrong — Marker protocol ensures consistency, but lacks atomicity
The examiner will ask

Compare Tamir & Sequin vs. Chandy & Lamport. When would you choose one over the other? Consider: system availability requirements, the cost of blocking, and the need for atomic commit of checkpoints.

16. Log-Based Protocols: Pessimistic, Optimistic, Causal

While checkpoint-based protocols save state, log-based protocols save events. Under the piecewise deterministic assumption, logged events can be replayed to recover the exact pre-failure state. Three approaches exist, differing in when and how non-deterministic events are logged [Alvisi and Marzullo, 1998].

Pessimistic logging synchronously logs each received message to stable storage before executing it. This is the safest approach:

  • Non-deterministic events are logged to stable storage before they affect state
  • On failure, the log contains every event needed for recovery
  • No missing messages — recovery is straightforward and fast
  • However, the synchronous log write adds latency overhead to every message processing

Optimistic logging reduces latency by storing non-deterministic events in volatile memory first, then flushing them asynchronously to stable storage:

  • Lower runtime overhead — no synchronous writes
  • Risk: if a process fails before the log is flushed, some events are permanently lost
  • Forces rollback to a state earlier than the failure point
  • Recovery is more complex — the recovering process must determine which messages are missing

Causal logging piggybacks non-deterministic event information on each outgoing message:

  • Instead of writing to stable storage immediately, message logs travel with the messages themselves
  • A process receiving a piggybacked message gains access to all events that causally affect its state
  • Upon failure, a process can reconstruct missing events from its neighbours' message logs
  • Balances latency overhead with recovery completeness
  • However, dependency tracking and piggybacking increase protocol complexity
Trade-off summary

Pessimistic: safest recovery, highest latency overhead. Optimistic: lowest overhead, risk of lost events. Causal: balanced overhead, sophisticated recovery. The choice depends on whether the application can tolerate lost events (optimistic) or must always recover to the exact pre-failure state (pessimistic).

In both optimistic and causal logging, process dependencies must be tracked and sufficient dependency information piggybacked on messages. This increases complexity and can cause cascading recovery operations. Pessimistic logging is much simpler in design and implementation, and failure recovery can be made much faster — at the cost of synchronous log writes.

17. Checkpointing + Logging: The Complete Picture

For all practical purposes, logging is always used in conjunction with checkpointing. The combination gives two essential benefits:

  1. Limited recovery time: A process restarts from its last checkpoint (not from its initial state) and replays logged non-deterministic events to reach the pre-failure state. Without checkpoints, the log would have to be replayed from the beginning of time.
  2. Limited log size: By taking a checkpoint periodically, all log entries prior to the checkpoint can be garbage collected — they are no longer needed because the checkpoint captures that state.
flowchart LR
    subgraph Normal Operation
        A[Process Running] --> B[Checkpoint Taken]
        B --> C[Messages Logged
in State Interval] C --> D[Non-deterministic
Event Occurs] D --> E[New State Interval] E --> F[Next Checkpoint] F --> C end subgraph On Failure G[CRASH] --> H[Restart from
Last Checkpoint] H --> I[Replay Logged
Non-deterministic Events] I --> J[Process Restored to
Pre-Failure State] end style G fill:#fca5a5,stroke:#991b1b,color:#1a2233 style J fill:#86efac,stroke:#065f46,color:#1a2233
Garbage collection

Checkpoint period determines the balance: a short period means less replay after recovery but higher checkpointing overhead. A long period means less overhead but more replay and larger logs. Most practical systems choose a period that balances these costs based on the failure rate and recovery time objectives.

18. Conclusion

Lessons Learnt

Dependability requires a coherent notion of dependencies and causality. You cannot recover a distributed system to a correct state unless you understand how its components depend on each other through message exchanges. The partial order of events in a distributed system forces us to reason carefully about which combinations of process states are admissible.

The key takeaways from this lesson:

Simple dependability techniques such as checkpointing and logging can dramatically improve the chances of recovery in case of failures. While they have limitations — checkpoint-only protocols lose post-checkpoint execution; logging protocols require the PWD assumption — they are used at all levels of dependability mechanisms and are the bedrock on which more sophisticated fault tolerance approaches are built.

The examiner will ask

Explain why checkpointing and logging are described as "the most fundamental techniques" for dependability. What can they achieve together that neither can achieve alone?

Check Your Understanding

Define a consistent global state in a distributed system. Why is an inconsistent global state unusable for recovery?

A consistent global state is a set of per-process checkpoints where for every message reception recorded in any checkpoint, the corresponding sending is also recorded in some other checkpoint in the set. It respects causality: the state could have actually been reached through some valid execution of the system. An inconsistent global state, by contrast, is not reachable from the initial state — it represents a "never-existed" configuration. If used for recovery, it could lead to effects without causes (e.g., money appearing from nowhere in a bank transfer scenario).

Explain the domino effect in uncoordinated checkpointing. How does coordinated checkpointing avoid it?

In uncoordinated checkpointing, each process checkpoints independently. When a crash occurs, the recovery process selects the most recent set of consistent checkpoints. If no consistent set exists near the crash point, processes must roll back further. This can cascade: rolling back process A may invalidate process B's checkpoint (because B's checkpoint had received a message from A that A's rolled-back checkpoint no longer shows as sent), forcing B to roll back too. This chain reaction is the domino effect. Coordinated checkpointing avoids it by ensuring all processes checkpoint at mutually consistent points — every received message in the new checkpoint set has a corresponding send recorded in the same round — so recovery always finds a consistent set at the most recent round.

Describe the two phases of the Tamir and Sequin global checkpointing protocol. Why is it called a "blocking" protocol?

Phase 1 (Quiesce): The coordinator stops execution and sends CHECKPOINT to all participants. Each participant stops, sends CHECKPOINT on outgoing channels, and collects CHECKPOINT from incoming channels — establishing a quiescent point where the system has no messages in transit. Phase 2 (Commit): Each process takes a local checkpoint and sends SAVED. The coordinator waits for all SAVED messages, then atomically switches to the new checkpoints and sends RESUME. It is blocking because processes suspend normal execution during the entire checkpointing round. No application messages are processed while checkpoints are being taken, ensuring consistency but reducing availability.

Compare pessimistic, optimistic, and causal logging. Under what conditions would you choose each?

Pessimistic logging writes every non-deterministic event to stable storage synchronously before execution. Recovery is simple and complete, but latency is high. Choose it when correctness is paramount and the system can tolerate the write latency. Optimistic logging buffers events in volatile memory and flushes asynchronously — lower overhead, but failure before flush means lost events and rollback. Choose it when failures are rare and occasional loss of recent events is acceptable. Causal logging piggybacks event info on messages, avoiding stable storage writes for most events. Recovery uses piggybacked information from neighbours. Choose it when you want a balance of low overhead and minimal event loss.

Why is it important to capture channel state in a global checkpoint? Give a concrete example of unrecoverable state without it.

Channel state captures messages that are in transit — sent but not yet received. Without it, a consistent checkpoint set could still lose messages. Example: P0 sends $50 to P1, and P1 simultaneously sends $30 to P2. Both processes checkpoint after sending their messages but before receiving. The checkpoints are consistent (no message is received without a corresponding send, because neither has received yet), but the $50 and $30 in transit are lost on recovery. The system would appear to have "lost" money. By recording channel state (the in-transit messages), the checkpoint captures the complete global state including money in transit.

Explain the piecewise deterministic (PWD) assumption. Why is it required for log-based recovery but not for checkpoint-based recovery?

The PWD assumption states that process execution consists of deterministic intervals separated by non-deterministic events (like message receptions). Each interval's outcome is fully determined by the non-deterministic event that starts it. If that event is logged, the interval can be replayed. Log-based recovery needs this assumption because it replays events from logs — without it, replay would not faithfully reproduce the original execution. Checkpoint-based protocols do not need PWD because they restore state directly from saved checkpoints; they do not replay events at the non-determinism boundary. They simply accept losing any post-checkpoint execution.

What is the output commit problem and how does it affect recovery protocol design?

The output commit problem arises because once a distributed system emits an output (e.g., a response to a client), the outside world observes it and may act on it. If the system then fails and rolls back, it cannot "un-emit" the output. This means the recovered state must be consistent with all outputs already committed. The solution is to ensure that enough recovery information is logged before any output is committed — typically, the system logs all events leading to the output before sending it. This way, if a failure occurs, the system can always recover to a state where that output remains valid.