Distributed Systems — Prof. Omicini — University of Bologna, ISI LM

Definitions & Goals for Distributed Systems

DS-M4 • Academic Year 2025/2026

In this lesson

1. Prologue

Distributed systems are everywhere: the Web, cloud computing, IoT, peer-to-peer networks, online banking, multiplayer gaming. But what exactly is a distributed system? This lesson establishes a solid conceptual foundation by examining the most influential definitions and the engineering goals that have shaped the field over the last decades.

You will see that there is no single definition — different communities emphasise different aspects. The computer scientist looks at what users observe; the computer engineer looks at how the system is built; the middleware architect looks at how components communicate. Each perspective is valid and each reveals something essential.

Key idea

The goal of this lesson is exposure to the classical general view on distributed systems that the technical and scientific community has developed, and to understand the goals that guide distributed systems engineers.

2. Classic Definitions

User’s View (Computer Scientist Definition)

A distributed system is a collection of independent computers that appears to its users as a single coherent system. — Tanenbaum & van Steen, 2017

This is an observational, a posteriori definition: it describes what the user sees. The user does not care about the multiplicity of machines, the network topology, or the location of data. What matters is that the system “just works” as one coherent whole. This definition captures the illusion of unity that well-designed distributed systems provide.

Engineer’s View (Computer Engineer Definition)

A distributed system is a collection of autonomous computational entities conceived as a single coherent system by its designer. — Tanenbaum & van Steen, 2017

This is a constructive, a priori definition: it describes how the system is designed. The key word is conceived: the designer intentionally builds multiple autonomous entities with the explicit goal that they work together coherently. This perspective emphasises that distributedness is a design choice, not an accident.

Exam tip

Compare and contrast the user view vs. the engineer view. The user view is observational (what users experience); the engineer view is constructive (how designers build it). The user view is a posteriori (after the fact); the engineer view is a priori (before the fact).

3. Remarks on the Definitions

Several important observations cut across both definitions:

Watch out

Heterogeneity is not a bug to be eliminated — it is an inherent property of distributed systems. The goal is to manage heterogeneity, not to remove it.

4. Message-Passing Definition

A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. — Coulouris et al., 2012

This definition foregrounds three essential features:

Editor’s note

The message-passing definition is particularly important because it highlights what is not available: shared memory, shared clock, reliable instantaneous communication. These absences are the root cause of most challenges in distributed systems design.

5. Middleware Architecture

How do we make many different computers work together as one coherent system? The classic architectural answer is middleware.

graph TB
    subgraph "Machine A"
        AppA["Application A"]
        MidA["Middleware"]
        OSA["OS + Hardware"]
    end
    subgraph "Machine B"
        AppB["Application B"]
        MidB["Middleware"]
        OSB["OS + Hardware"]
    end
    AppA --> MidA
    MidA --> OSA
    AppB --> MidB
    MidB --> OSB
    OSA <==> OSB

The middleware layer extends over multiple machines and offers every application the same interface. It achieves two things:

Key idea

Middleware is the principled solution to both collaboration (working together as one) and amalgamation (looking together as one). It provides separation of concerns: applications do not deal with distribution directly; middleware handles it.

6. Pitfalls of Distributed Systems

Building a distributed system looks easy — hardware, software, and networking components are readily available. However, distribution introduces problems rather than solving them at first glance. L. Peter Deutsch identified eight false assumptions that first-time distributed system developers make.

Reality check

These eight assumptions are all false in real distributed systems. Every experienced distributed systems engineer has been burned by each of them at some point. Recognising them is the first step to building robust systems.

These false assumptions relate to properties unique to distributed systems: network reliability, security, heterogeneity, topology, latency, bandwidth, transport cost, and administrative domains. In non-distributed (centralised) systems, these problems typically do not surface because the entire system runs under a single administrative domain with predictable hardware and no network in the critical path.

7. Goal: Resource Availability

Resources in any real-world setting are physically distributed: printers sit in different rooms, storage servers live in different datacentres, sensors are scattered across a geographic area. A fundamental reason to build a distributed system is to make these distributed resources available to users as if they belonged to a single system.

What is a resource?

Anything that could be connected to a computational system and that anyone could legitimately use: printers, scanners, storage devices, distributed sensors, databases, compute nodes, network bandwidth, and so on.

By enabling interaction between users and resources, distributed systems become enablers of sharing, information exchange, and collaboration. A prime example is grid computing, where computational resources across institutional boundaries are pooled to solve large-scale problems.

Editor’s note

Resource availability is the most basic goal: a distributed system must first work before we worry about how transparently or scalably it works.

8. Goal: Distribution Transparency

Physical distribution is often not a feature from the user’s perspective. Users want to use resources without caring where those resources are physically located or how they are implemented internally.

Transparency is the hiding of non-relevant properties of a system’s components and structure from the user’s perception. By hiding irrelevant details, distributed systems provide users with a higher level of abstraction.

Key idea

Transparency means the distributed system masks its own distributed nature whenever the user does not need to be aware of it. The system appears as a single, centralised system even though it is not.

9. Types of Transparency

There are several distinct types of transparency, each targeting a different aspect of distribution. Explore them in the interactive widget below.

These seven types of transparency are not independent — access transparency is foundational to most others, and failure transparency often depends on replication transparency. The ISO Reference Model for Open Distributed Processing (RM-ODP) formalises these into a standard taxonomy.

10. Access & Location Transparency

Access Transparency

Different resources represent data differently, have different internal structures, and expose different access protocols. Access transparency hides all of this heterogeneity behind a uniform, homogeneous interface. Whether a file is stored on a Linux server, a Windows NAS, or a cloud object store, the user accesses it through the same API.

Examples: NFS (Network File System) provides access transparency by making remote files appear as local files; SQL databases hide storage engine differences behind a standard query language.

Location Transparency

The physical location of a resource is often irrelevant to its use. Location transparency hides where a resource is physically situated. Resources are accessed through logical identifiers (names) that are not bound to physical network addresses.

The classic example is the URL: https://example.com/paper.pdf identifies a resource without revealing which server, datacentre, or country it lives in. The DNS system resolves the logical name to a physical address transparently.

Exam tip

Be able to distinguish access transparency (hiding how to access) from location transparency (hiding where the resource is). Access transparency is about homogeneous interfaces; location transparency is about logical naming.

11. Migration, Relocation & Replication Transparency

Migration Transparency

Resources and users can be mobile. A resource might move from one node to another, or a user might change their point of attachment. Migration transparency hides changes of location so that the system maintains coherence and functionality without disruption. Modern examples include virtual machine live migration in cloud environments and mobile users roaming between cell towers.

Relocation Transparency

While migration transparency handles the fact that a resource can move, relocation transparency handles the fact that a resource is moving right now. It is a specialised, online version of migration transparency: resources remain accessible even while they are changing location. This is significantly harder because both the old and new locations may need to cooperate during the transition.

Replication Transparency

Replication is used to improve performance (local copies reduce latency and bandwidth consumption) and fault tolerance (redundancy masks failures). Replication transparency hides the existence of multiple copies from users: all replicas share the same name and are kept in sufficiently consistent state that users perceive a single resource. Users should never need to know they are accessing a replica.

Challenge

Replication transparency is in tension with consistency. Keeping replicas identical requires synchronisation, which costs time and bandwidth. The level of transparency (how tightly replicas must match) is a design decision.

12. Concurrency & Failure Transparency

Concurrency Transparency

In a distributed system, users and resources work autonomously and concurrently. Two users might try to access the same resource at the same time. Concurrency transparency hides the fact that resources are shared among concurrent users. The system manages access policies and ensures that concurrent accesses do not corrupt resource state — and does so transparently.

The fundamental problem is consistency under concurrency: when multiple users read and write the same data simultaneously, the system must guarantee that the resource remains in a consistent state. This is typically achieved through locking, transactions, or optimistic concurrency control, all hidden from the user.

Failure Transparency

In a distributed system, anything can fail: a network link, a disk, a server process, a power supply. As Leslie Lamport famously noted: “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.”

Failure transparency aims to mask failures and hide the recovery of resources from users. This is perhaps the hardest type of transparency because of the latency problem: how do you distinguish between a dead resource and a very slow one? Is “silence” from a resource caused by processing delay, by a deliberate choice to not respond, by resource failure, or by network failure? Without a bound on message delivery time, these are indistinguishable.

Key idea

Distribution turns failure from a binary (working/broken) into a spectrum (partial failure). While any single component can fail, other parts of the system likely keep working. The challenge is to mask partial failures so the overall system remains available.

13. Degree of Transparency

Hiding distribution is not always the best idea. There are situations where awareness of distribution is valuable:

Exam tip

The transparency trade-off: too much transparency hides useful information; too little transparency burdens the user with irrelevant details. Every engineer must find the right degree for their system, balancing performance, understandability, and user needs.

14. Goal: Openness

An open distributed system is one that can work with a number and variety of components that is not determined once and for all at design time. Open systems are designed to be extensible and are fundamentally unpredictable in terms of what components may join later.

How to Design for Unpredictability

Designing for openness requires predictable items that all components agree on:

IDLs capture syntax (the signatures of operations) but often do not specify semantics or the protocol (the expected order of interactions). This is a limitation: two components may agree on the interface syntax but still interoperate incorrectly if they expect different interaction sequences.

Non-Functional Properties of Open Systems

PropertyDefinition
InteroperabilityHow easily can one component or system work with different implementations based on the same standard specifications?
PortabilityHow easily can an application (or part of it) be moved to a different distributed system and keep working?
ExtensibilityHow easily can new components and functionality be added to an existing distributed system?

15. Goal: Scalability

A distributed system is scalable if it can handle growth along any relevant dimension without breaking.

Dimensions of Scalability (Neuman, 1994)

A system scales up when:

What Hinders Scalability?

Centralisation is the enemy of scalability:

Centralised ElementWhy it hurts scalability
Centralised servicesA single server for all users becomes a bottleneck as load grows
Centralised dataA single database for all components creates contention and limited throughput
Centralised algorithmsAlgorithms that assume complete information available in one place do not scale

Sometimes centralisation is necessary (security requirements, normative constraints, or optimal theoretical efficiency), but it should be avoided whenever possible.

Decentralised vs. Centralised Algorithms

The trouble with centralised algorithms in distributed systems (Raynal, 2013):

Features of decentralised algorithms (Kshemkalyani & Singhal, 2011):

FeatureDescription
No complete informationNo machine has complete knowledge of the system state
Local decision-makingMachines make decisions based only on locally available information
No single point of failureFailure of one machine does not ruin the algorithm
No global clockNo implicit assumption that a global clock exists

16. Scaling Techniques

Three fundamental techniques for achieving geographical scalability (Neuman, 1994):

1. Hiding Communication Latency

The basic idea: avoid wasting time waiting for remote responses. Use asynchronous communication whenever possible: send a request, continue working, and handle the response when it arrives via an interrupt or callback. This prevents the application from stalling.

Limitation: Some interactions are inherently synchronous (a web user waiting for a page). In these cases, code shipping is an alternative — send executable code to the client (e.g., JavaScript form validation) so the interaction can proceed locally without round trips.

sequenceDiagram
    participant C as Client
    participant S as Server
    C->>S: Submit form
    Note right of C: (a) Server checks form
Round-trip latency for each error S-->>C: Error - retry C->>S: Resubmit activate S S-->>C: OK deactivate S
sequenceDiagram
    participant C as Client
    participant S as Server
    Note right of C: (b) Client checks form
via shipped JavaScript C->>C: Validate locally C->>S: Submit valid form activate S S-->>C: OK deactivate S

2. Distribution

Take a component, split it into parts, and spread the parts across the system. The classic example is the Domain Name System (DNS):

graph TD
    Root["."]
    com["com"]
    org["org"]
    it["it"]
    example["example.com"]
    unibo["unibo.it"]
    apice["apice.unibo.it"]
    Root --> com
    Root --> org
    Root --> it
    com --> example
    it --> unibo
    unibo --> apice

3. Replication

When performance degrades, replicate components across the system to increase availability and reduce latency. Copies are placed near potential users. Caching is a special form of replication, but with a key difference:

AspectReplicationCaching
Who decidesOwner of the resourceClient of the resource
LifetimeLong-lived, managedShort-lived, may be evicted
ConsistencyStronger guaranteesWeaker guarantees
The consistency problem

Any form of duplication (replication or caching) introduces the consistency problem: copies can diverge. Inconsistency is technically unavoidable in a distributed setting because updates take time to propagate. The question is how much inconsistency the system can tolerate and how to hide it from users and components.

17. Goal: Situatedness

Editor’s note

Situatedness is not a classic distributed systems goal in the traditional literature. It appears here because modern systems (mobile, pervasive, IoT) have made it essential. The slides cite Suchman (2007) as the foundational reference.

Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and suitably dealing with environmental events. Mobile, adaptive, and pervasive computing systems have made situatedness a key concern.

Context-Awareness: Space & Time

Any non-trivial system needs to know where it is working and when, in order to perform its function effectively. This is the spatio-temporal context. Examples: a GPS-guided drone, a smart traffic system that adjusts signals based on time of day and congestion, a disaster response system that coordinates rescue teams based on their locations.

Context-Awareness: Environment

Beyond space and time, systems need awareness of the broader environment: its nature, structure, available resources, and potential hazards. This includes understanding the physical environment (temperature, humidity, obstacles) and the social environment (people, organisations, policies).

Knowledge-Intensive Environments (KIE)

A particularly relevant class is knowledge-intensive environments: environments where large amounts of distributed knowledge are essential for system activity. Systems must access, understand, and potentially inject knowledge while interacting locally within the working environment.

Situatedness & Distributed Systems

Physical distribution of computational systems is essential to cope with the distributed nature of many working environments. When requirements mandate situated computation within a distributed physical environment, situated distributed systems are the only way out. Examples:

Openness of distributed situated systems is essential to deal with the unpredictability of complex environments. Scalability allows them to cope with working environments of growing complexity.

18. Conclusion & Open Questions

Lessons Learned

Open Questions

Questions to reflect on

What is middleware? We have seen it as an architectural layer, but how does it relate to modern microservices, service meshes, and serverless computing?

Are we ready to deal with resources and situatedness? The notion of “resource” is very general — can our current tools handle the full spectrum from hardware devices to knowledge artefacts in KIE?

What is the place of IoT here? Billions of devices, highly heterogeneous, geographically dispersed, resource-constrained — IoT is a challenging test case for all five goals.

Check Your Understanding

Define a distributed system from the user’s perspective. How does it differ from the engineer’s perspective?

User’s perspective (Tanenbaum & van Steen): “a collection of independent computers that appears to its users as a single coherent system” — observational, a posteriori. Engineer’s perspective: “a collection of autonomous computational entities conceived as a single coherent system by its designer” — constructive, a priori. The user view focuses on what is experienced; the engineer view focuses on how it is designed.

Explain the role of middleware in a distributed system. What two problems does it solve?

Middleware solves (1) collaboration — enabling meaningful interaction between autonomous distributed components by providing communication mechanisms; and (2) amalgamation — hiding differences in technology, structure, and behaviour by providing a common shared interface across all machines. It extends over multiple machines and offers each application the same interface.

List Deutsch’s eight false assumptions about distributed systems. Why are they dangerous?

(1) The network is reliable. (2) The network is secure. (3) The network is homogeneous. (4) The topology does not change. (5) Latency is zero. (6) Bandwidth is infinite. (7) Transport cost is zero. (8) There is one administrator. These are dangerous because they seem reasonable based on experience with non-distributed systems, but they are all false in real distributed deployments — building on them leads to fragile, unreliable systems.

Define the seven types of distribution transparency. Which one is the hardest to achieve and why?

Access, location, migration, relocation, replication, concurrency, failure. Failure transparency is arguably the hardest because of the latency problem: without a bound on message delivery time, a silent node could be dead or merely slow. These are indistinguishable in an asynchronous system, making perfect failure detection impossible.

Explain why full transparency is not always desirable. Give a concrete example.

Sometimes location awareness is useful. Example: a user downloading a large file might prefer to know whether the server is in the US, Japan, or Europe to estimate download time and cost. If the system hides the server location entirely, the user cannot make an informed choice. Another example: if a system hides time zone changes, a download might appear to finish before it started when a user crosses time zone boundaries.

What is openness in a distributed system? How do IDLs contribute to openness?

Openness is the property of a system to work with a number and variety of components that is not fixed at design time. IDLs (Interface Definition Languages) contribute by providing a standard way to specify the syntax of interfaces that components expose. However, IDLs typically capture syntax but not semantics or protocol, so interoperability still requires agreement on behaviour, not just signatures.

Describe the three dimensions of scalability. Give an example of a centralised design choice that hinders each dimension.

(1) Size scalability — number of users/resources grows. Hindered by a single central server for all users. (2) Geographical scalability — geographical distribution extends. Hindered by synchronous communication that requires waiting for remote responses across long distances. (3) Administrative scalability — number of administrative domains grows. Hindered by having a single administrator or a single security policy.

What are the four features of decentralised algorithms according to Kshemkalyani & Singhal?

(1) No machine has complete information about the system state. (2) Machines make decisions based only on local information. (3) Failure of one machine does not ruin the algorithm. (4) There is no implicit assumption that a global clock exists.

Define situatedness. Why is it relevant to modern distributed systems?

Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and dealing with environmental events. It is relevant to modern distributed systems because mobile, adaptive, pervasive, and IoT systems must operate within and respond to their physical environment. Situatedness requires at minimum spatial and temporal context awareness.

Distinguish between caching and replication. Why does both introduce a consistency problem?

Caching is a client-driven copy decision (short-lived, may be evicted); replication is an owner-driven copy decision (long-lived, managed). Both introduce inconsistency because copies take time to propagate updates, so replicas can diverge from the original. The question is how much inconsistency the application can tolerate and how to hide it from users.