DS-M1: Dependability in Distributed Systems

2. The Economics of Dependability

The cost of failures

When a data centre is brought down by a system failure, the average cost of downtime ranges from tens to hundreds of thousands of euros per hour — summing wasted expenses and loss of revenue. This makes failure prevention and recovery a top economic priority for any organisation running distributed systems.

The cost of dependability

Dependability is not cheap either. The cost of data centres varies dramatically with the required availability level:

Availability level	Downtime / year	Cost per square foot
99.671% (three nines)	28.8 hours	$450
99.995% (four nines)	0.4 hours (~26 min)	$1,100

The data is striking: significantly higher availability roughly doubles the infrastructure cost. This partly explains why nearly 60% of Fortune 500 companies suffer from more than one and a half hours of downtime per week.

Key idea

There is a fundamental trade-off: the cost of failures (downtime) versus the cost of preventing them. Training more engineers who know how to design, implement, and maintain dependable distributed systems is one of the most effective ways to reduce both.

3. System Models for Dependability

To reason about dependability, we need two things: a way to model a distributed system appropriately, and a way to model the threats to that system.

What is a system?

A (distributed) system is designed to provide a set of services to its users (often called clients). Each service has an interface that a client uses to request the service. A functional specification defines what a service should do.

State of a distributed system

At each moment in time, a system is in a given state. The state of a distributed system is determined collectively by the state of its processes and threads — values of registers, stacks, heaps, file descriptors, kernel state, and so on. Part of this state is visible through user interaction (external state or observable state), as defined by the functional specification. The rest is the internal state.

Definition

The state of a system at time t is the (minimum amount of) information that, along with the knowledge about the dynamics of a (deterministic) system, allows an observer to completely describe the future system behaviour from time t on.

State for recovery

State can be used for recovery after a failure. If the state is serialised and written to stable storage before a failure, and it remains accessible and intact after the failure, the system can be recovered to the state it had before the failure occurred.

System boundaries

Before we can capture state, we must define the boundary of the system — the physical or logical line separating the system from its surrounding environment. Inside the boundary: the system's components. Outside: the system's environment (all other systems that affect it). Importantly, every definition of component versus system depends on the chosen level of abstraction. A system at one level may be a component of a larger system at another level.

Editor's note

The question "how can we capture the state of a distributed system at any time t?" leads directly to the complex issue of time in distributed systems — a topic for a future lesson. In a distributed system there is no global clock, so capturing a consistent global state is surprisingly hard.

4. Threat Models: Fault, Error, Failure

When a system is not compliant with its functional specification, we say that a (service) failure has occurred. The failure is caused by part of its state having wrong values — errors in its state. Errors, in turn, are caused by some faults.

This three-level causal model is the foundation of dependability analysis:

Fault — The root cause: a defect in a system component. A fault may be dormant (not yet causing problems) until specific conditions activate it.
Error — When a fault is activated, it creates an error: a part of the system state that deviates from the correct value. The error may remain internal or propagate.
Failure — When the error propagates to the service interface and causes the delivered service to deviate from its specification, a service failure occurs.

A fault might remain dormant for a long time. For example, a bug in software that is not executed, or a shared variable in a multithreaded application that is not lock-protected but is only accessed by one thread at a time. When the specific condition is met, the fault is activated, causing an error in the component. As the component interacts with others, the error propagates through the system. When the error reaches the interface, a service failure occurs.

Important

Due to the recursive nature of system composition, the failure of one system may cause a fault in a larger system when the former is a component of the latter. This recursive relationship is called the chain of threats — and it is why the terms "fault" and "failure" are often (improperly) used interchangeably in the literature.

flowchart LR
    F["Fault"] -->|Activation| E["Error"]
    E -->|Propagation| F2["Failure"]
    F2 -->|Causation in
larger system| F3["Fault (parent)"]

6. Taxonomies of Faults

Different sorts of faults require different treatments. Faults can be classified according to six criteria: source, intent, duration, manifestation, reproducibility, and relationship with other faults. Click through the tabs below to explore each classification.

Source of Faults

Based on their source, faults are classified as:

Hardware faults — caused by failure of hardware components: power outages, hard drive failures, bad memory chips, etc.
Software faults — caused by software bugs: race conditions, no-boundary-checks for arrays, logic errors, etc.
Operator faults — caused by human operators: misconfiguration, wrong upgrade procedures, accidental deletions, etc.

Hardware faults are typically random and independent; software faults are systematic and often correlated across replicas (all replicas run the same code). Operator faults are the hardest to predict and guard against.

Intent of Faults

Based on intent, faults are classified as:

Non-malicious faults — not caused with malicious intent. Examples: naturally-occurring hardware faults, unintended software bugs.
Malicious faults — caused by a person with intent to harm the system. Examples: denial of service attacks, integrity compromises, Byzantine faults (where a component may behave arbitrarily — sending conflicting information to different parts of the system).

The examiner will ask

Byzantine faults represent the most severe class of malicious faults: a faulty component may send arbitrary, possibly conflicting, values to different components. This models scenarios where a node is compromised and controlled by an attacker.

Duration of Faults

Based on duration, faults are classified as:

Transient faults — activated momentarily, then go dormant again. Example: a power spike that affects a hardware component and then disappears.
Intermittent faults — occur, vanish on their own, reappear, and so on. Example: a race condition that occurs only when two threads access the same shared variable at exactly the same time.
Permanent faults — once activated, remain until the faulty component is repaired or the source is addressed. Example: a power outage (a computer stays off until power is restored), or a process crash.

Transient and intermittent faults are particularly insidious because they are hard to reproduce and diagnose. Permanent faults are simpler to handle (the component is clearly broken), but they may cause extended downtime if redundancy is lacking.

Manifestation of Faults

Based on their manifestation, faults are classified as:

Content faults — the faulty component passes wrong values to other components. It may always pass the same wrong values, or it may return different values to different components (the latter is a Byzantine fault).
Timing faults — the faulty component returns a reply too early or too late. The extreme case: the component stops responding entirely (infinite response time), as when it crashes or hangs due to an infinite loop or deadlock.

Reproducibility of Faults

Based on reproducibility, faults are classified as:

Reproducible (deterministic) faults — happen deterministically and are easy to reproduce. Example: accessing a null pointer. These are typically easy to identify and repair.
Nondeterministic faults — appear to happen nondeterministically and are hard to reproduce. Example: a fault caused by a specific interleaving of threads accessing a shared variable. These are sometimes called Heisenbugs, highlighting their uncertainty (they seem to disappear when you try to observe them).

Editor's note

The name "Heisenbug" is a pun on Werner Heisenberg's uncertainty principle: the act of observing the bug (e.g., by adding logging or attaching a debugger) changes the timing and thread interleaving, making the bug disappear.

Relationship with Other Faults

Based on their relationship with other faults:

Independent faults — no causal relationship between them. Fault A does not cause fault B and vice versa.
Correlated faults — causally related. Fault B is caused by fault A, or vice versa. When multiple components fail due to a common reason, this is specifically called a common mode failure.

Common mode failures are particularly dangerous for redundant systems: if all replicas run the same software, a software bug can cause all of them to fail simultaneously, defeating the purpose of replication.

Failure Models: The Tanenbaum and van Steen Classification

An alternative classification focuses on how a system behaves when it fails:

Failure model	Description
Crash failure	A server halts but was working correctly until it stopped
Omission failure	A server fails to respond to incoming requests
Timing failure	A server's response lies outside the specified time interval
Response failure	A server's response is incorrect (value failure) or the server transitions through incorrect states (state transition failure)
Byzantine failure	A server produces arbitrary responses at arbitrary times (the most general and severe failure model)

7. Failure Models: Fail-Stop, Fail-Safe, Fail-Fast

Fail-stop systems

When a system fails, it is desirable to avoid catastrophic consequences. A fail-stop system is enhanced with dependability mechanisms so that when it fails, it simply stops responding to requests. This clean halt makes the failure obvious to other components and prevents the propagation of corrupted data.

Fail-safe systems

Some systems are designed so that their failure does not cause great harm to human life or the environment. A fail-safe system defines a set of safe states. When it can no longer operate according to its specification, it transitions to one of these predefined safe states. A nuclear power plant control system must be fail-safe: if it cannot operate correctly, it must shut down safely rather than continue in an undefined state.

Fail-fast

Fail-fast is a software engineering practice where a system halts its operation immediately when it enters an error state or encounters an unexpected condition. This enables early detection and diagnosis of faults: once a fault propagates to many components, pinpointing the source becomes much harder. Fail-fast is a design principle that prioritises immediate failure over silent corruption.

Key insight

Fail-stop and fail-fast are about making failures visible and contained so that other parts of the system can react appropriately. Fail-safe is about ensuring that even when the system fails, the consequences remain bounded and non-catastrophic.

8. Dependability Attributes: The Five Pillars

A dependable system is characterised by five key attributes. Some can be quantified as evaluation metrics; others are qualitative. Some are fundamental to all distributed systems; others are secondary or application-specific.

Attribute	Definition	Quantifiable?	Fundamental?
Availability	The system is ready for immediate use	Yes (MTTF, MTTR)	Fundamental
Reliability	The system runs continuously without failure	Yes (R(t))	Fundamental
Integrity	The system's state cannot be compromised	Difficult	Fundamental
Maintainability	A failed system can be easily repaired	Difficult	Secondary
Safety	Failure does not cause catastrophic harm	Difficult	Secondary

The examiner will ask

Availability vs. reliability: are they the same? No. Availability measures the probability that the system is ready at a given instant. Reliability measures the probability that the system runs continuously without failure over a time interval. A system that fails frequently but recovers instantly has high availability but low reliability.

9. Availability: Quantitative View

Key metrics

MTTF — Mean Time To Failure: the average time until a system fails.
MTTR — Mean Time To Repair: the average time needed to repair a system after failure.
MTBF — Mean Time Between Failures: simply MTTF + MTTR.

Formal definition of availability

Availability = MTTF / (MTTF + MTTR) = MTTF / MTBF

Availability is commonly expressed in terms of nines:

Nines	Availability	Downtime / year
2 nines	99%	87.6 hours (3.65 days)
3 nines	99.9%	8.76 hours
4 nines	99.99%	52.6 minutes
5 nines	99.999%	5.26 minutes
6 nines	99.9999%	31.5 seconds

For example, a system claiming "five nines" availability means it has probability 0.99999 of being available at any instant — equivalently, at most 5.256 minutes of downtime per year.

Key insight

To improve availability, you can either increase MTTF (make the system more reliable, so it fails less often) or decrease MTTR (make the system recover faster). The latter is often more cost-effective: it is frequently cheaper to build fast recovery mechanisms than to eliminate every possible fault.

11. Reliability: Continuous Operation

Definition

Reliability is a measure of the system's capability to provide correct services continuously for a period of time. It is often represented as the probability that the system does so for a given time interval Δt:

R(Δt) = e^{(-λ · Δt)}

where λ ∈ [0, ∞) is the failure rate, approximately proportional to 1/MTTF.

Availability does not equal reliability

This is a critical distinction. Consider a system that fails frequently but recovers almost instantly:

Availability is high because at any given instant the system is very likely to be up (it is only briefly down after each failure).
Reliability is low because the system cannot run for a long period without encountering a failure.

The examiner will ask

Give an example of a system with high availability and low reliability — and one with the opposite. High availability, low reliability: a web server that crashes every 10 minutes but reboots in 2 seconds (99.97% available but unreliable over any 30-minute window). Low availability, high reliability: a scientific computation server that runs flawlessly for 6 months but then requires 2 weeks of maintenance.

Key insight

While availability is an instantaneous measure, reliability is a durational measure. Availability is about being ready now; reliability is about staying correct over time.

12. Integrity, Maintainability, Safety

Integrity

Integrity refers to the capability of a system to protect its state from being compromised under various threats. In dependable computing research, integrity typically translates into the consistency of server replicas when redundancy is used. As long as the number of faulty replicas does not exceed a pre-defined threshold, the consistency of the correct replicas naturally implies system integrity.

Maintainability

Maintainability refers to the capability of a system to evolve after it is deployed. Once a fault is detected, applying a repair patch should not involve uninstalling the existing system and reinstalling an updated one. The same patching or software update mechanism may also be used to add new features or improve performance. A highly-maintainable system supports live upgrades — updates applied without taking the system offline. Maintainability is closely related to availability: a system that is easy to repair tends to have higher availability (lower MTTR).

Safety

Safety means that when a system fails, it does not cause catastrophic consequences — the system must be fail-safe. Safety is crucial for systems that operate in environments where failure could endanger human lives or cause massive damage: nuclear power plant control systems, hospital operating room systems, aircraft flight control systems. Safety is less important for systems that do not operate in such environments, such as e-commerce platforms.

Important distinction

Safety is about the consequences of failure, not the frequency. A system can be highly available and reliable but still unsafe if a failure — however rare — leads to catastrophic outcomes. Conversely, a system can be safe even with low availability if it fails into a harmless state.

Approach	When applied	Goal
Fault avoidance	Before deployment	Prevent faults from being introduced
Fault detection & diagnosis	During operation	Discover faults and locate their source
Fault removal	After detection	Isolate and eliminate faulty components
Fault tolerance	During operation	Continue correct service despite faults

14. Fault Avoidance, Detection, and Diagnosis

Fault avoidance

Fault avoidance aims to prevent faults from ever being introduced. For software, this means ensuring correct design specification and correct implementation before the system is deployed. Techniques include:

Rigorous software design, possibly using formal methods to assess and verify system properties.
Rigorous software testing, to identify and remove bugs due to design deficiencies or implementation errors.

For hardware, it means using quality components from reputable manufacturers, with proper burn-in testing and environmental controls.

Fault detection and diagnosis

Detection can be tricky. Some faults are trivial to detect — for instance, crash faults can be caught by periodically probing each component to check its health (heartbeat monitoring). However, components may fail in ways other than crashing, so probing alone is insufficient.

Once a fault is detected, diagnosis is required to confirm that a fault has indeed occurred and to localise its source — that is, to pinpoint the faulty component. Formal models and statistical tools are used for this purpose.

Example

Exception handling in modern programming languages is a concrete example of fault detection and handling at the code level. An exception is raised when an error is detected; it propagates up the call stack until a handler catches it, containing the fault before it causes a failure.

15. Fault Removal and Fault Tolerance

Fault removal

Once a fault is detected and localised, it should be isolated and removed from the system. The faulty component is either repaired or replaced, then reintroduced into the system — which typically requires reconfiguration. In a distributed system, this requires a notion of membership: the faulty component is excluded from the system, and the repaired component becomes part of the membership again.

Special case

Software updates are a special case of fault removal (and sometimes fault avoidance). The patching mechanism must itself be dependable: a failed update could leave the system in an inconsistent state.

Fault tolerance

Because hardware failures are inevitable, robust software alone is not enough to deliver high dependability. Unless a system is fully stateless, simply restarting after a failure will not automatically restore its state before the failure. Fault tolerance techniques are therefore essential to raise dependability to the next level.

Different techniques target different dependability requirements:

For applications that need high availability but not necessarily high reliability, logging and checkpointing may be sufficient. These techniques periodically record the system state; after a failure, the system rolls back to the most recent checkpoint.
More demanding applications may adopt recovery-oriented computing techniques, which focus on minimising recovery time rather than preventing failures.

Both approaches rely on rollback recovery — returning to the most recent recorded correct state.

16. Masking Failure by Redundancy

The key technique for masking faults — hiding failures from other processes — is redundancy. Redundancy comes in three forms:

Type of redundancy	Description	Example
Information redundancy	Adding extra bits to detect or correct errors	Error-correcting codes (ECC memory), checksums, CRCs, parity bits
Time redundancy	Repeating operations to recover from transient faults	Retrying a failed transaction, redoing after transaction abort
Physical redundancy	Using additional hardware or software components	Server replication, RAID storage, multiple network paths

Physical redundancy is the most powerful form — it is the principle behind biological systems (two eyes, two kidneys) and engineered reliable systems alike. In distributed systems, physical redundancy typically means replication: running multiple copies of a service so that if one fails, others can take over.

Key insight

Redundancy is not free. It adds cost (more hardware, more complex software), complexity (keeping replicas consistent), and potential for correlated failures (if all replicas share the same software bug). The goal is to add just enough redundancy to meet the dependability requirements without overspending.

The examiner will ask

Explain why information redundancy alone is insufficient to mask a Byzantine fault. Hint: a Byzantine node can arbitrarily modify both data and checksums. Information redundancy works for random bit errors but not against malicious adversaries who control the computation itself.

Check Your Understanding

Explain the difference between a fault, an error, and a failure. How are they causally related?

A fault is the root cause — a defect in a system component. When activated, a fault produces an error: an incorrect value in the system state. When the error propagates to the service interface and causes the delivered service to deviate from its specification, a failure occurs. The causal chain is: fault → error (through activation), error → failure (through propagation). Due to system composition, a failure in one component becomes a fault for the larger system containing it.

Give one example of each: transient fault, intermittent fault, permanent fault.

Transient: a power spike affecting a hardware component momentarily. Intermittent: a race condition that occurs only when two specific threads interleave in a particular way. Permanent: a power outage that keeps the system off until power is restored, or a process crash.

What is a Heisenbug? Why is it called that?

A Heisenbug is a nondeterministic fault that is hard to reproduce, typically caused by specific thread interleavings when accessing shared variables. The name plays on Heisenberg's uncertainty principle: the act of observing the bug (e.g., by adding logging or attaching a debugger) changes timing and thread interleaving, making the bug disappear or change behaviour.

Derive the relationship between availability, MTTF, and MTTR. Why can two systems with the same MTTF have different availability?

Availability = MTTF / (MTTF + MTTR). Two systems can have the same MTTF (fail equally often) but different MTTR (recovery time). A system that recovers in minutes has much higher availability than one that takes hours, even if they fail at the same rate. This is why fast recovery mechanisms are often more cost-effective than trying to eliminate every fault.

Give a concrete scenario where a system has high availability but low reliability. Explain why.

A web server that crashes every 10 minutes but reboots in 2 seconds. At any given instant, the probability it is up is (600 - 2)/600 = 99.67%, which is reasonably high availability. However, the probability it runs without failure for even 30 minutes is essentially zero — very low reliability. It is ready most of the time but cannot sustain correct operation.

What are the four main approaches to improving dependability? Give an example of each.

(1) Fault avoidance: using formal methods to verify correctness before deployment. (2) Fault detection & diagnosis: heartbeat monitoring to detect crash faults. (3) Fault removal: excluding a faulty node from the system membership and replacing it. (4) Fault tolerance: using replication so that if one server fails, another continues to provide the service.

Explain the three types of redundancy and give a distributed-systems example of each.

Information redundancy: checksums on messages to detect corruption during transmission. Time redundancy: retrying a failed RPC call after a timeout. Physical redundancy: running three replicas of a stateful service so the system continues if any one replica fails.

What is a common mode failure and why is it dangerous for redundant systems?

A common mode failure occurs when multiple components fail due to the same underlying cause. If all replicas run identical software, a single software bug can crash all of them simultaneously, defeating the purpose of replication entirely. This is why diversity (different implementations, different platforms) is sometimes used for critical systems.

What is the difference between fail-stop and fail-safe?

Fail-stop: the system stops responding when it fails, making the failure obvious and preventing propagation of corrupted data. Fail-safe: the system transitions to a predefined safe state so that even in failure, no catastrophic harm occurs. A fail-stop system may still cause harm if it stops at a bad time; a fail-safe system guarantees the consequences are bounded.

If a system has MTBF = 500 hours and MTTR = 2 hours, what is its availability? Express in nines.

MTTF = MTBF - MTTR = 498 hours. Availability = 498 / 500 = 0.996 = 99.6%. Nines = -log10(1 - 0.996) = -log10(0.004) = 2.40, so between 2 and 3 nines.

In this lesson

1. Prologue: Why Dependability Matters

2. The Economics of Dependability

The cost of failures

The cost of dependability

3. System Models for Dependability

What is a system?

State of a distributed system

State for recovery

System boundaries

4. Threat Models: Fault, Error, Failure

5. Interactive: Chain of Threats Explorer

6. Taxonomies of Faults

Source of Faults

Intent of Faults

Duration of Faults

Manifestation of Faults

Reproducibility of Faults

Relationship with Other Faults

Failure Models: The Tanenbaum and van Steen Classification

7. Failure Models: Fail-Stop, Fail-Safe, Fail-Fast

Fail-stop systems

Fail-safe systems

Fail-fast

8. Dependability Attributes: The Five Pillars

9. Availability: Quantitative View

Key metrics

Formal definition of availability

10. Interactive: Availability Nines Calculator

Availability Nines Calculator

11. Reliability: Continuous Operation

Definition

Availability does not equal reliability

12. Integrity, Maintainability, Safety

Integrity

Maintainability

Safety

13. Means to Achieve Dependability

14. Fault Avoidance, Detection, and Diagnosis

Fault avoidance

Fault detection and diagnosis

15. Fault Removal and Fault Tolerance

Fault removal

Fault tolerance

16. Masking Failure by Redundancy

Check Your Understanding