Distributed systems are everywhere: the Web, cloud computing, IoT, peer-to-peer networks, online banking, multiplayer gaming. But what exactly is a distributed system? This lesson establishes a solid conceptual foundation by examining the most influential definitions and the engineering goals that have shaped the field over the last decades.
You will see that there is no single definition — different communities emphasise different aspects. The computer scientist looks at what users observe; the computer engineer looks at how the system is built; the middleware architect looks at how components communicate. Each perspective is valid and each reveals something essential.
The goal of this lesson is exposure to the classical general view on distributed systems that the technical and scientific community has developed, and to understand the goals that guide distributed systems engineers.
A distributed system is a collection of independent computers that appears to its users as a single coherent system. — Tanenbaum & van Steen, 2017
This is an observational, a posteriori definition: it describes what the user sees. The user does not care about the multiplicity of machines, the network topology, or the location of data. What matters is that the system “just works” as one coherent whole. This definition captures the illusion of unity that well-designed distributed systems provide.
A distributed system is a collection of autonomous computational entities conceived as a single coherent system by its designer. — Tanenbaum & van Steen, 2017
This is a constructive, a priori definition: it describes how the system is designed. The key word is conceived: the designer intentionally builds multiple autonomous entities with the explicit goal that they work together coherently. This perspective emphasises that distributedness is a design choice, not an accident.
Compare and contrast the user view vs. the engineer view. The user view is observational (what users experience); the engineer view is constructive (how designers build it). The user view is a posteriori (after the fact); the engineer view is a priori (before the fact).
Several important observations cut across both definitions:
Heterogeneity is not a bug to be eliminated — it is an inherent property of distributed systems. The goal is to manage heterogeneity, not to remove it.
A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. — Coulouris et al., 2012
This definition foregrounds three essential features:
The message-passing definition is particularly important because it highlights what is not available: shared memory, shared clock, reliable instantaneous communication. These absences are the root cause of most challenges in distributed systems design.
How do we make many different computers work together as one coherent system? The classic architectural answer is middleware.
graph TB
subgraph "Machine A"
AppA["Application A"]
MidA["Middleware"]
OSA["OS + Hardware"]
end
subgraph "Machine B"
AppB["Application B"]
MidB["Middleware"]
OSB["OS + Hardware"]
end
AppA --> MidA
MidA --> OSA
AppB --> MidB
MidB --> OSB
OSA <==> OSB
The middleware layer extends over multiple machines and offers every application the same interface. It achieves two things:
Middleware is the principled solution to both collaboration (working together as one) and amalgamation (looking together as one). It provides separation of concerns: applications do not deal with distribution directly; middleware handles it.
Building a distributed system looks easy — hardware, software, and networking components are readily available. However, distribution introduces problems rather than solving them at first glance. L. Peter Deutsch identified eight false assumptions that first-time distributed system developers make.
These eight assumptions are all false in real distributed systems. Every experienced distributed systems engineer has been burned by each of them at some point. Recognising them is the first step to building robust systems.
These false assumptions relate to properties unique to distributed systems: network reliability, security, heterogeneity, topology, latency, bandwidth, transport cost, and administrative domains. In non-distributed (centralised) systems, these problems typically do not surface because the entire system runs under a single administrative domain with predictable hardware and no network in the critical path.
Resources in any real-world setting are physically distributed: printers sit in different rooms, storage servers live in different datacentres, sensors are scattered across a geographic area. A fundamental reason to build a distributed system is to make these distributed resources available to users as if they belonged to a single system.
Anything that could be connected to a computational system and that anyone could legitimately use: printers, scanners, storage devices, distributed sensors, databases, compute nodes, network bandwidth, and so on.
By enabling interaction between users and resources, distributed systems become enablers of sharing, information exchange, and collaboration. A prime example is grid computing, where computational resources across institutional boundaries are pooled to solve large-scale problems.
Resource availability is the most basic goal: a distributed system must first work before we worry about how transparently or scalably it works.
Physical distribution is often not a feature from the user’s perspective. Users want to use resources without caring where those resources are physically located or how they are implemented internally.
Transparency is the hiding of non-relevant properties of a system’s components and structure from the user’s perception. By hiding irrelevant details, distributed systems provide users with a higher level of abstraction.
Transparency means the distributed system masks its own distributed nature whenever the user does not need to be aware of it. The system appears as a single, centralised system even though it is not.
There are several distinct types of transparency, each targeting a different aspect of distribution. Explore them in the interactive widget below.
These seven types of transparency are not independent — access transparency is foundational to most others, and failure transparency often depends on replication transparency. The ISO Reference Model for Open Distributed Processing (RM-ODP) formalises these into a standard taxonomy.
Different resources represent data differently, have different internal structures, and expose different access protocols. Access transparency hides all of this heterogeneity behind a uniform, homogeneous interface. Whether a file is stored on a Linux server, a Windows NAS, or a cloud object store, the user accesses it through the same API.
Examples: NFS (Network File System) provides access transparency by making remote files appear as local files; SQL databases hide storage engine differences behind a standard query language.
The physical location of a resource is often irrelevant to its use. Location transparency hides where a resource is physically situated. Resources are accessed through logical identifiers (names) that are not bound to physical network addresses.
The classic example is the URL: https://example.com/paper.pdf identifies a resource without revealing which server, datacentre, or country it lives in. The DNS system resolves the logical name to a physical address transparently.
Be able to distinguish access transparency (hiding how to access) from location transparency (hiding where the resource is). Access transparency is about homogeneous interfaces; location transparency is about logical naming.
Resources and users can be mobile. A resource might move from one node to another, or a user might change their point of attachment. Migration transparency hides changes of location so that the system maintains coherence and functionality without disruption. Modern examples include virtual machine live migration in cloud environments and mobile users roaming between cell towers.
While migration transparency handles the fact that a resource can move, relocation transparency handles the fact that a resource is moving right now. It is a specialised, online version of migration transparency: resources remain accessible even while they are changing location. This is significantly harder because both the old and new locations may need to cooperate during the transition.
Replication is used to improve performance (local copies reduce latency and bandwidth consumption) and fault tolerance (redundancy masks failures). Replication transparency hides the existence of multiple copies from users: all replicas share the same name and are kept in sufficiently consistent state that users perceive a single resource. Users should never need to know they are accessing a replica.
Replication transparency is in tension with consistency. Keeping replicas identical requires synchronisation, which costs time and bandwidth. The level of transparency (how tightly replicas must match) is a design decision.
In a distributed system, users and resources work autonomously and concurrently. Two users might try to access the same resource at the same time. Concurrency transparency hides the fact that resources are shared among concurrent users. The system manages access policies and ensures that concurrent accesses do not corrupt resource state — and does so transparently.
The fundamental problem is consistency under concurrency: when multiple users read and write the same data simultaneously, the system must guarantee that the resource remains in a consistent state. This is typically achieved through locking, transactions, or optimistic concurrency control, all hidden from the user.
In a distributed system, anything can fail: a network link, a disk, a server process, a power supply. As Leslie Lamport famously noted: “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.”
Failure transparency aims to mask failures and hide the recovery of resources from users. This is perhaps the hardest type of transparency because of the latency problem: how do you distinguish between a dead resource and a very slow one? Is “silence” from a resource caused by processing delay, by a deliberate choice to not respond, by resource failure, or by network failure? Without a bound on message delivery time, these are indistinguishable.
Distribution turns failure from a binary (working/broken) into a spectrum (partial failure). While any single component can fail, other parts of the system likely keep working. The challenge is to mask partial failures so the overall system remains available.
Hiding distribution is not always the best idea. There are situations where awareness of distribution is valuable:
The transparency trade-off: too much transparency hides useful information; too little transparency burdens the user with irrelevant details. Every engineer must find the right degree for their system, balancing performance, understandability, and user needs.
An open distributed system is one that can work with a number and variety of components that is not determined once and for all at design time. Open systems are designed to be extensible and are fundamentally unpredictable in terms of what components may join later.
Designing for openness requires predictable items that all components agree on:
IDLs capture syntax (the signatures of operations) but often do not specify semantics or the protocol (the expected order of interactions). This is a limitation: two components may agree on the interface syntax but still interoperate incorrectly if they expect different interaction sequences.
| Property | Definition |
|---|---|
| Interoperability | How easily can one component or system work with different implementations based on the same standard specifications? |
| Portability | How easily can an application (or part of it) be moved to a different distributed system and keep working? |
| Extensibility | How easily can new components and functionality be added to an existing distributed system? |
A distributed system is scalable if it can handle growth along any relevant dimension without breaking.
A system scales up when:
Centralisation is the enemy of scalability:
| Centralised Element | Why it hurts scalability |
|---|---|
| Centralised services | A single server for all users becomes a bottleneck as load grows |
| Centralised data | A single database for all components creates contention and limited throughput |
| Centralised algorithms | Algorithms that assume complete information available in one place do not scale |
Sometimes centralisation is necessary (security requirements, normative constraints, or optimal theoretical efficiency), but it should be avoided whenever possible.
The trouble with centralised algorithms in distributed systems (Raynal, 2013):
Features of decentralised algorithms (Kshemkalyani & Singhal, 2011):
| Feature | Description |
|---|---|
| No complete information | No machine has complete knowledge of the system state |
| Local decision-making | Machines make decisions based only on locally available information |
| No single point of failure | Failure of one machine does not ruin the algorithm |
| No global clock | No implicit assumption that a global clock exists |
Three fundamental techniques for achieving geographical scalability (Neuman, 1994):
The basic idea: avoid wasting time waiting for remote responses. Use asynchronous communication whenever possible: send a request, continue working, and handle the response when it arrives via an interrupt or callback. This prevents the application from stalling.
Limitation: Some interactions are inherently synchronous (a web user waiting for a page). In these cases, code shipping is an alternative — send executable code to the client (e.g., JavaScript form validation) so the interaction can proceed locally without round trips.
sequenceDiagram
participant C as Client
participant S as Server
C->>S: Submit form
Note right of C: (a) Server checks form
Round-trip latency for each error
S-->>C: Error - retry
C->>S: Resubmit
activate S
S-->>C: OK
deactivate S
sequenceDiagram
participant C as Client
participant S as Server
Note right of C: (b) Client checks form
via shipped JavaScript
C->>C: Validate locally
C->>S: Submit valid form
activate S
S-->>C: OK
deactivate S
Take a component, split it into parts, and spread the parts across the system. The classic example is the Domain Name System (DNS):
graph TD
Root["."]
com["com"]
org["org"]
it["it"]
example["example.com"]
unibo["unibo.it"]
apice["apice.unibo.it"]
Root --> com
Root --> org
Root --> it
com --> example
it --> unibo
unibo --> apice
When performance degrades, replicate components across the system to increase availability and reduce latency. Copies are placed near potential users. Caching is a special form of replication, but with a key difference:
| Aspect | Replication | Caching |
|---|---|---|
| Who decides | Owner of the resource | Client of the resource |
| Lifetime | Long-lived, managed | Short-lived, may be evicted |
| Consistency | Stronger guarantees | Weaker guarantees |
Any form of duplication (replication or caching) introduces the consistency problem: copies can diverge. Inconsistency is technically unavoidable in a distributed setting because updates take time to propagate. The question is how much inconsistency the system can tolerate and how to hide it from users and components.
Situatedness is not a classic distributed systems goal in the traditional literature. It appears here because modern systems (mobile, pervasive, IoT) have made it essential. The slides cite Suchman (2007) as the foundational reference.
Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and suitably dealing with environmental events. Mobile, adaptive, and pervasive computing systems have made situatedness a key concern.
Any non-trivial system needs to know where it is working and when, in order to perform its function effectively. This is the spatio-temporal context. Examples: a GPS-guided drone, a smart traffic system that adjusts signals based on time of day and congestion, a disaster response system that coordinates rescue teams based on their locations.
Beyond space and time, systems need awareness of the broader environment: its nature, structure, available resources, and potential hazards. This includes understanding the physical environment (temperature, humidity, obstacles) and the social environment (people, organisations, policies).
A particularly relevant class is knowledge-intensive environments: environments where large amounts of distributed knowledge are essential for system activity. Systems must access, understand, and potentially inject knowledge while interacting locally within the working environment.
Physical distribution of computational systems is essential to cope with the distributed nature of many working environments. When requirements mandate situated computation within a distributed physical environment, situated distributed systems are the only way out. Examples:
Openness of distributed situated systems is essential to deal with the unpredictability of complex environments. Scalability allows them to cope with working environments of growing complexity.
What is middleware? We have seen it as an architectural layer, but how does it relate to modern microservices, service meshes, and serverless computing?
Are we ready to deal with resources and situatedness? The notion of “resource” is very general — can our current tools handle the full spectrum from hardware devices to knowledge artefacts in KIE?
What is the place of IoT here? Billions of devices, highly heterogeneous, geographically dispersed, resource-constrained — IoT is a challenging test case for all five goals.
User’s perspective (Tanenbaum & van Steen): “a collection of independent computers that appears to its users as a single coherent system” — observational, a posteriori. Engineer’s perspective: “a collection of autonomous computational entities conceived as a single coherent system by its designer” — constructive, a priori. The user view focuses on what is experienced; the engineer view focuses on how it is designed.
Middleware solves (1) collaboration — enabling meaningful interaction between autonomous distributed components by providing communication mechanisms; and (2) amalgamation — hiding differences in technology, structure, and behaviour by providing a common shared interface across all machines. It extends over multiple machines and offers each application the same interface.
(1) The network is reliable. (2) The network is secure. (3) The network is homogeneous. (4) The topology does not change. (5) Latency is zero. (6) Bandwidth is infinite. (7) Transport cost is zero. (8) There is one administrator. These are dangerous because they seem reasonable based on experience with non-distributed systems, but they are all false in real distributed deployments — building on them leads to fragile, unreliable systems.
Access, location, migration, relocation, replication, concurrency, failure. Failure transparency is arguably the hardest because of the latency problem: without a bound on message delivery time, a silent node could be dead or merely slow. These are indistinguishable in an asynchronous system, making perfect failure detection impossible.
Sometimes location awareness is useful. Example: a user downloading a large file might prefer to know whether the server is in the US, Japan, or Europe to estimate download time and cost. If the system hides the server location entirely, the user cannot make an informed choice. Another example: if a system hides time zone changes, a download might appear to finish before it started when a user crosses time zone boundaries.
Openness is the property of a system to work with a number and variety of components that is not fixed at design time. IDLs (Interface Definition Languages) contribute by providing a standard way to specify the syntax of interfaces that components expose. However, IDLs typically capture syntax but not semantics or protocol, so interoperability still requires agreement on behaviour, not just signatures.
(1) Size scalability — number of users/resources grows. Hindered by a single central server for all users. (2) Geographical scalability — geographical distribution extends. Hindered by synchronous communication that requires waiting for remote responses across long distances. (3) Administrative scalability — number of administrative domains grows. Hindered by having a single administrator or a single security policy.
(1) No machine has complete information about the system state. (2) Machines make decisions based only on local information. (3) Failure of one machine does not ruin the algorithm. (4) There is no implicit assumption that a global clock exists.
Situatedness is the property of a system being immersed in its environment: capable of (timely) perceiving and producing environmental change, and dealing with environmental events. It is relevant to modern distributed systems because mobile, adaptive, pervasive, and IoT systems must operate within and respond to their physical environment. Situatedness requires at minimum spatial and temporal context awareness.
Caching is a client-driven copy decision (short-lived, may be evicted); replication is an owner-driven copy decision (long-lived, managed). Both introduce inconsistency because copies take time to propagate updates, so replicas can diverge from the original. The question is how much inconsistency the application can tolerate and how to hide it from users.