DS-M4

2. Classic Definitions

User’s View (Computer Scientist Definition)

A distributed system is a collection of independent computers that appears to its users as a single coherent system. — Tanenbaum & van Steen, 2017

This is an observational, a posteriori definition: it describes what the user sees. The user does not care about the multiplicity of machines, the network topology, or the location of data. What matters is that the system “just works” as one coherent whole. This definition captures the illusion of unity that well-designed distributed systems provide.

Engineer’s View (Computer Engineer Definition)

A distributed system is a collection of autonomous computational entities conceived as a single coherent system by its designer. — Tanenbaum & van Steen, 2017

This is a constructive, a priori definition: it describes how the system is designed. The key word is conceived: the designer intentionally builds multiple autonomous entities with the explicit goal that they work together coherently. This perspective emphasises that distributedness is a design choice, not an accident.

Exam tip

Compare and contrast the user view vs. the engineer view. The user view is observational (what users experience); the engineer view is constructive (how designers build it). The user view is a posteriori (after the fact); the engineer view is a priori (before the fact).

5. Middleware Architecture

How do we make many different computers work together as one coherent system? The classic architectural answer is middleware.

graph TB
    subgraph "Machine A"
        AppA["Application A"]
        MidA["Middleware"]
        OSA["OS + Hardware"]
    end
    subgraph "Machine B"
        AppB["Application B"]
        MidB["Middleware"]
        OSB["OS + Hardware"]
    end
    AppA --> MidA
    MidA --> OSA
    AppB --> MidB
    MidB --> OSB
    OSA <==> OSB

The middleware layer extends over multiple machines and offers every application the same interface. It achieves two things:

Collaboration (coherence) — by enabling meaningful interaction between autonomous components through well-defined communication mechanisms (message formats, protocols, semantics).
Amalgamation (uniformity) — by hiding differences in technology, structure, and behaviour behind a common shared interface, just as an operating system hides hardware differences from applications.

Key idea

Middleware is the principled solution to both collaboration (working together as one) and amalgamation (looking together as one). It provides separation of concerns: applications do not deal with distribution directly; middleware handles it.

6. Pitfalls of Distributed Systems

Building a distributed system looks easy — hardware, software, and networking components are readily available. However, distribution introduces problems rather than solving them at first glance. L. Peter Deutsch identified eight false assumptions that first-time distributed system developers make.

Reality check

These eight assumptions are all false in real distributed systems. Every experienced distributed systems engineer has been burned by each of them at some point. Recognising them is the first step to building robust systems.

These false assumptions relate to properties unique to distributed systems: network reliability, security, heterogeneity, topology, latency, bandwidth, transport cost, and administrative domains. In non-distributed (centralised) systems, these problems typically do not surface because the entire system runs under a single administrative domain with predictable hardware and no network in the critical path.

7. Goal: Resource Availability

Resources in any real-world setting are physically distributed: printers sit in different rooms, storage servers live in different datacentres, sensors are scattered across a geographic area. A fundamental reason to build a distributed system is to make these distributed resources available to users as if they belonged to a single system.

What is a resource?

Anything that could be connected to a computational system and that anyone could legitimately use: printers, scanners, storage devices, distributed sensors, databases, compute nodes, network bandwidth, and so on.

By enabling interaction between users and resources, distributed systems become enablers of sharing, information exchange, and collaboration. A prime example is grid computing, where computational resources across institutional boundaries are pooled to solve large-scale problems.

Editor’s note

Resource availability is the most basic goal: a distributed system must first work before we worry about how transparently or scalably it works.

10. Access & Location Transparency

Access Transparency

Different resources represent data differently, have different internal structures, and expose different access protocols. Access transparency hides all of this heterogeneity behind a uniform, homogeneous interface. Whether a file is stored on a Linux server, a Windows NAS, or a cloud object store, the user accesses it through the same API.

Examples: NFS (Network File System) provides access transparency by making remote files appear as local files; SQL databases hide storage engine differences behind a standard query language.

Location Transparency

The physical location of a resource is often irrelevant to its use. Location transparency hides where a resource is physically situated. Resources are accessed through logical identifiers (names) that are not bound to physical network addresses.

The classic example is the URL: https://example.com/paper.pdf identifies a resource without revealing which server, datacentre, or country it lives in. The DNS system resolves the logical name to a physical address transparently.

Exam tip

Be able to distinguish access transparency (hiding how to access) from location transparency (hiding where the resource is). Access transparency is about homogeneous interfaces; location transparency is about logical naming.

11. Migration, Relocation & Replication Transparency

Migration Transparency

Resources and users can be mobile. A resource might move from one node to another, or a user might change their point of attachment. Migration transparency hides changes of location so that the system maintains coherence and functionality without disruption. Modern examples include virtual machine live migration in cloud environments and mobile users roaming between cell towers.

Relocation Transparency

While migration transparency handles the fact that a resource can move, relocation transparency handles the fact that a resource is moving right now. It is a specialised, online version of migration transparency: resources remain accessible even while they are changing location. This is significantly harder because both the old and new locations may need to cooperate during the transition.

Replication Transparency

Replication is used to improve performance (local copies reduce latency and bandwidth consumption) and fault tolerance (redundancy masks failures). Replication transparency hides the existence of multiple copies from users: all replicas share the same name and are kept in sufficiently consistent state that users perceive a single resource. Users should never need to know they are accessing a replica.

Challenge

Replication transparency is in tension with consistency. Keeping replicas identical requires synchronisation, which costs time and bandwidth. The level of transparency (how tightly replicas must match) is a design decision.

12. Concurrency & Failure Transparency

Concurrency Transparency

In a distributed system, users and resources work autonomously and concurrently. Two users might try to access the same resource at the same time. Concurrency transparency hides the fact that resources are shared among concurrent users. The system manages access policies and ensures that concurrent accesses do not corrupt resource state — and does so transparently.

The fundamental problem is consistency under concurrency: when multiple users read and write the same data simultaneously, the system must guarantee that the resource remains in a consistent state. This is typically achieved through locking, transactions, or optimistic concurrency control, all hidden from the user.

Failure Transparency

In a distributed system, anything can fail: a network link, a disk, a server process, a power supply. As Leslie Lamport famously noted: “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.”

Failure transparency aims to mask failures and hide the recovery of resources from users. This is perhaps the hardest type of transparency because of the latency problem: how do you distinguish between a dead resource and a very slow one? Is “silence” from a resource caused by processing delay, by a deliberate choice to not respond, by resource failure, or by network failure? Without a bound on message delivery time, these are indistinguishable.

Key idea

Distribution turns failure from a binary (working/broken) into a spectrum (partial failure). While any single component can fail, other parts of the system likely keep working. The challenge is to mask partial failures so the overall system remains available.

14. Goal: Openness

An open distributed system is one that can work with a number and variety of components that is not determined once and for all at design time. Open systems are designed to be extensible and are fundamentally unpredictable in terms of what components may join later.

How to Design for Unpredictability

Designing for openness requires predictable items that all components agree on:

Standard rules for service syntax and semantics
Standard message interchange formats and protocols
Interface Definition Languages (IDLs) to specify how interfaces are defined

IDLs capture syntax (the signatures of operations) but often do not specify semantics or the protocol (the expected order of interactions). This is a limitation: two components may agree on the interface syntax but still interoperate incorrectly if they expect different interaction sequences.

Non-Functional Properties of Open Systems

Property	Definition
Interoperability	How easily can one component or system work with different implementations based on the same standard specifications?
Portability	How easily can an application (or part of it) be moved to a different distributed system and keep working?
Extensibility	How easily can new components and functionality be added to an existing distributed system?

15. Goal: Scalability

A distributed system is scalable if it can handle growth along any relevant dimension without breaking.

Dimensions of Scalability (Neuman, 1994)

A system scales up when:

The number of users and resources grows (size scalability)
The geographical distribution of users and resources extends (geographical scalability)
The system spans a growing number of distinct administrative domains (administrative scalability)

What Hinders Scalability?

Centralisation is the enemy of scalability:

Centralised Element	Why it hurts scalability
Centralised services	A single server for all users becomes a bottleneck as load grows
Centralised data	A single database for all components creates contention and limited throughput
Centralised algorithms	Algorithms that assume complete information available in one place do not scale

Sometimes centralisation is necessary (security requirements, normative constraints, or optimal theoretical efficiency), but it should be avoided whenever possible.

Decentralised vs. Centralised Algorithms

The trouble with centralised algorithms in distributed systems (Raynal, 2013):

Data must flow from the whole network to and from a single location, overloading the network
Any transmission problem affects the entire algorithm

Features of decentralised algorithms (Kshemkalyani & Singhal, 2011):

Feature	Description
No complete information	No machine has complete knowledge of the system state
Local decision-making	Machines make decisions based only on locally available information
No single point of failure	Failure of one machine does not ruin the algorithm
No global clock	No implicit assumption that a global clock exists

16. Scaling Techniques

Three fundamental techniques for achieving geographical scalability (Neuman, 1994):

1. Hiding Communication Latency

The basic idea: avoid wasting time waiting for remote responses. Use asynchronous communication whenever possible: send a request, continue working, and handle the response when it arrives via an interrupt or callback. This prevents the application from stalling.

Limitation: Some interactions are inherently synchronous (a web user waiting for a page). In these cases, code shipping is an alternative — send executable code to the client (e.g., JavaScript form validation) so the interaction can proceed locally without round trips.

sequenceDiagram
    participant C as Client
    participant S as Server
    C->>S: Submit form
    Note right of C: (a) Server checks form
Round-trip latency for each error
    S-->>C: Error - retry
    C->>S: Resubmit
    activate S
    S-->>C: OK
    deactivate S

sequenceDiagram
    participant C as Client
    participant S as Server
    Note right of C: (b) Client checks form
via shipped JavaScript
    C->>C: Validate locally
    C->>S: Submit valid form
    activate S
    S-->>C: OK
    deactivate S

2. Distribution

Take a component, split it into parts, and spread the parts across the system. The classic example is the Domain Name System (DNS):

The DNS namespace is hierarchically organised into a tree of domains
Domains are divided into non-overlapping zones
Each zone is served by a single (authoritative) server
The naming service is distributed across many machines without centralisation

graph TD
    Root["."]
    com["com"]
    org["org"]
    it["it"]
    example["example.com"]
    unibo["unibo.it"]
    apice["apice.unibo.it"]
    Root --> com
    Root --> org
    Root --> it
    com --> example
    it --> unibo
    unibo --> apice

3. Replication

When performance degrades, replicate components across the system to increase availability and reduce latency. Copies are placed near potential users. Caching is a special form of replication, but with a key difference:

Aspect	Replication	Caching
Who decides	Owner of the resource	Client of the resource
Lifetime	Long-lived, managed	Short-lived, may be evicted
Consistency	Stronger guarantees	Weaker guarantees

The consistency problem

Any form of duplication (replication or caching) introduces the consistency problem: copies can diverge. Inconsistency is technically unavoidable in a distributed setting because updates take time to propagate. The question is how much inconsistency the system can tolerate and how to hide it from users and components.

17. Goal: Situatedness

Editor’s note

Situatedness is not a classic distributed systems goal in the traditional literature. It appears here because modern systems (mobile, pervasive, IoT) have made it essential. The slides cite Suchman (2007) as the foundational reference.

Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and suitably dealing with environmental events. Mobile, adaptive, and pervasive computing systems have made situatedness a key concern.

Context-Awareness: Space & Time

Any non-trivial system needs to know where it is working and when, in order to perform its function effectively. This is the spatio-temporal context. Examples: a GPS-guided drone, a smart traffic system that adjusts signals based on time of day and congestion, a disaster response system that coordinates rescue teams based on their locations.

Context-Awareness: Environment

Beyond space and time, systems need awareness of the broader environment: its nature, structure, available resources, and potential hazards. This includes understanding the physical environment (temperature, humidity, obstacles) and the social environment (people, organisations, policies).

Knowledge-Intensive Environments (KIE)

A particularly relevant class is knowledge-intensive environments: environments where large amounts of distributed knowledge are essential for system activity. Systems must access, understand, and potentially inject knowledge while interacting locally within the working environment.

Situatedness & Distributed Systems

Physical distribution of computational systems is essential to cope with the distributed nature of many working environments. When requirements mandate situated computation within a distributed physical environment, situated distributed systems are the only way out. Examples:

Disaster recovery scenarios (rescue teams, sensors, drones)
Environmental monitoring (distributed sensors collecting data)
Crowd steering and management
Live event coordination

Openness of distributed situated systems is essential to deal with the unpredictability of complex environments. Scalability allows them to cope with working environments of growing complexity.

18. Conclusion & Open Questions

Lessons Learned

There are several converging definitions for distributed systems, each highlighting different aspects (user view, engineer view, message-passing view).
Distributed systems are easy to build — which means they are easy to misbuild. Deutsch’s pitfalls remind us that assumptions that work in centralised systems break in distributed ones.
The five design goals — resource availability, transparency, openness, scalability, and situatedness — capture what makes a distributed system good, not just functional.
Modern concerns increasingly emphasise robustness, reliability, and fault tolerance alongside the classical goals.

Open Questions

Questions to reflect on

What is middleware? We have seen it as an architectural layer, but how does it relate to modern microservices, service meshes, and serverless computing?

Are we ready to deal with resources and situatedness? The notion of “resource” is very general — can our current tools handle the full spectrum from hardware devices to knowledge artefacts in KIE?

What is the place of IoT here? Billions of devices, highly heterogeneous, geographically dispersed, resource-constrained — IoT is a challenging test case for all five goals.

Check Your Understanding

Define a distributed system from the user’s perspective. How does it differ from the engineer’s perspective?

User’s perspective (Tanenbaum & van Steen): “a collection of independent computers that appears to its users as a single coherent system” — observational, a posteriori. Engineer’s perspective: “a collection of autonomous computational entities conceived as a single coherent system by its designer” — constructive, a priori. The user view focuses on what is experienced; the engineer view focuses on how it is designed.

Explain the role of middleware in a distributed system. What two problems does it solve?

Middleware solves (1) collaboration — enabling meaningful interaction between autonomous distributed components by providing communication mechanisms; and (2) amalgamation — hiding differences in technology, structure, and behaviour by providing a common shared interface across all machines. It extends over multiple machines and offers each application the same interface.

List Deutsch’s eight false assumptions about distributed systems. Why are they dangerous?

(1) The network is reliable. (2) The network is secure. (3) The network is homogeneous. (4) The topology does not change. (5) Latency is zero. (6) Bandwidth is infinite. (7) Transport cost is zero. (8) There is one administrator. These are dangerous because they seem reasonable based on experience with non-distributed systems, but they are all false in real distributed deployments — building on them leads to fragile, unreliable systems.

Define the seven types of distribution transparency. Which one is the hardest to achieve and why?

Access, location, migration, relocation, replication, concurrency, failure. Failure transparency is arguably the hardest because of the latency problem: without a bound on message delivery time, a silent node could be dead or merely slow. These are indistinguishable in an asynchronous system, making perfect failure detection impossible.

Explain why full transparency is not always desirable. Give a concrete example.

Sometimes location awareness is useful. Example: a user downloading a large file might prefer to know whether the server is in the US, Japan, or Europe to estimate download time and cost. If the system hides the server location entirely, the user cannot make an informed choice. Another example: if a system hides time zone changes, a download might appear to finish before it started when a user crosses time zone boundaries.

What is openness in a distributed system? How do IDLs contribute to openness?

Openness is the property of a system to work with a number and variety of components that is not fixed at design time. IDLs (Interface Definition Languages) contribute by providing a standard way to specify the syntax of interfaces that components expose. However, IDLs typically capture syntax but not semantics or protocol, so interoperability still requires agreement on behaviour, not just signatures.

Describe the three dimensions of scalability. Give an example of a centralised design choice that hinders each dimension.

(1) Size scalability — number of users/resources grows. Hindered by a single central server for all users. (2) Geographical scalability — geographical distribution extends. Hindered by synchronous communication that requires waiting for remote responses across long distances. (3) Administrative scalability — number of administrative domains grows. Hindered by having a single administrator or a single security policy.

What are the four features of decentralised algorithms according to Kshemkalyani & Singhal?

(1) No machine has complete information about the system state. (2) Machines make decisions based only on local information. (3) Failure of one machine does not ruin the algorithm. (4) There is no implicit assumption that a global clock exists.

Define situatedness. Why is it relevant to modern distributed systems?

Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and dealing with environmental events. It is relevant to modern distributed systems because mobile, adaptive, pervasive, and IoT systems must operate within and respond to their physical environment. Situatedness requires at minimum spatial and temporal context awareness.

Distinguish between caching and replication. Why does both introduce a consistency problem?

Caching is a client-driven copy decision (short-lived, may be evicted); replication is an owner-driven copy decision (long-lived, managed). Both introduce inconsistency because copies take time to propagate updates, so replicas can diverge from the original. The question is how much inconsistency the application can tolerate and how to hide it from users.

In this lesson

1. Prologue

2. Classic Definitions

User’s View (Computer Scientist Definition)

Engineer’s View (Computer Engineer Definition)

3. Remarks on the Definitions

4. Message-Passing Definition

5. Middleware Architecture

6. Pitfalls of Distributed Systems

7. Goal: Resource Availability

What is a resource?

8. Goal: Distribution Transparency

9. Types of Transparency

10. Access & Location Transparency

Access Transparency

Location Transparency

11. Migration, Relocation & Replication Transparency

Migration Transparency

Relocation Transparency

Replication Transparency

12. Concurrency & Failure Transparency

Concurrency Transparency

Failure Transparency

13. Degree of Transparency

14. Goal: Openness

How to Design for Unpredictability

Non-Functional Properties of Open Systems

15. Goal: Scalability

Dimensions of Scalability (Neuman, 1994)

What Hinders Scalability?

Decentralised vs. Centralised Algorithms

16. Scaling Techniques

1. Hiding Communication Latency

2. Distribution

3. Replication

17. Goal: Situatedness

Context-Awareness: Space & Time

Context-Awareness: Environment

Knowledge-Intensive Environments (KIE)

Situatedness & Distributed Systems

18. Conclusion & Open Questions

Lessons Learned

Open Questions

Check Your Understanding