Distributed Systems — Prof. Omicini — University of Bologna, ISI LM

Sorts of Distributed Systems

DS-M5 • Academic Year 2025/2026

In this lesson

1. Prologue: Three Generations

Distributed systems are not a new thing. Decades ago, distributed computing was already being studied before we even fully understood what computation itself is. Today, distributed systems are everywhere: their sorts are so diverse in nature, structure, organisation, and dynamics that classifying them all is neither an easy task nor a particularly useful one.

Nevertheless, over the first decades of their existence, changing needs and available resources promoted an evolution of distributed systems that generated, over time, three recognisable fundamental classes. Each class emerged from a different combination of technological constraints, application requirements, and user expectations. Understanding these classes is essential because they provide a vocabulary for talking about what a distributed system does and how it does it.

Key idea

The three classes are not strict categories with rigid boundaries. They represent evolutionary phases. Modern distributed systems often borrow features from all three classes simultaneously.

2. The Three Fundamental Classes

The slides identify three main classes of distributed systems, each corresponding to a different era and a different set of concerns:

ClassPrimary ConcernEraExample
Distributed Computing SystemsHigh-performance computation via aggregated resources1990sBeowulf clusters, computational grids
Distributed Information SystemsIntegration of separate, heterogeneous applications2000sTP monitors, EAI middleware
Distributed Pervasive SystemsSystems embedded in unstable, mobile environments2010sSensor networks, health care systems

These three classes differ not just in technology but in their fundamental assumptions: computing systems assume homogeneity and controlled environments; information systems assume heterogeneity but stable infrastructure; pervasive systems assume instability and constant change.

3. Distributed Computing Systems: Clusters

The defining characteristic of distributed computing systems is that they use a multiplicity of distributed computers to perform high-performance tasks. The goal is aggregate power: combining many ordinary machines to do what a single supercomputer would do at a fraction of the cost.

Cluster Computing Systems

Cluster computing is built around a simple idea: take a collection of similar workstations or PCs, running the same OS, located in the same area, and interconnect them through a high-speed LAN. The result is a single, unified computing resource.

Key idea

The motivation for cluster computing is economic: the ever-increasing price/performance ratio of commodity computers makes it cheaper to build a supercomputer from many simple computers than to buy a dedicated high-performance machine. Additionally, clusters offer better robustness, easier maintenance, and incremental addition of computing power.

The primary usage of cluster computers is parallel computing: a single computationally-intensive program is split into parts that run simultaneously on multiple machines. This is only possible when the problem can be decomposed into independent or loosely-coupled sub-problems — a property called embarrassingly parallel in the best case, and difficult to parallelise in the worst.

graph LR
    subgraph "Master Node"
        M["Scheduler
Job Queue
Data Distributor"] end subgraph "Compute Nodes" C1["Node 1
Worker"] C2["Node 2
Worker"] C3["Node 3
Worker"] CN["Node N
Worker"] end M --> C1 M --> C2 M --> C3 M --> CN C1 -.-> M C2 -.-> M C3 -.-> M CN -.-> M

The scheduler (or master node) is responsible for decomposing the computation, distributing work units to worker nodes, collecting results, and handling failures. This introduces a single point of failure and a potential bottleneck — a tension that grid computing will address by removing the assumption of centralised control.

4. Beowulf: A Cluster Case Study

The most famous example of cluster computing is the Beowulf architecture. Beowulf clusters are Linux-based: each cluster is a collection of computing nodes controlled and accessed by a single master node. The master node provides the user interface, storage, and job scheduling; the worker nodes are typically diskless and boot from the network, which reduces cost and simplifies management.

Beowulf demonstrated that commodity hardware + open-source software could compete with proprietary supercomputers. The key design decisions were:

Editor’s note

The name "Beowulf" comes from the Old English epic poem — a reference to the legendary strength of the hero. The name was chosen to evoke the idea of many ordinary components combining to achieve extraordinary power.

5. Cluster vs. Grid: Homogeneity and Heterogeneity

The critical distinction between cluster computing and grid computing lies along the homogeneity-heterogeneity axis:

PropertyCluster ComputingGrid Computing
HardwareSimilar or identical machinesDiverse, different architectures
Operating SystemSame OS across all nodesMultiple OSes, versions
NetworkSingle high-speed LANWide-area, heterogeneous links
AdministrationSingle administrative domainMultiple domains, organisations
Failure ModelNodes fail, network is reliableNodes and network both unreliable

In essence, cluster computer systems are homogeneous while grid computer systems are heterogeneous by design. This difference drives radically different architectural approaches: clusters can rely on uniform interfaces and predictable performance, while grids must handle diversity and unpredictability at every layer.

6. Grid Computing and Virtual Organisations

Grid computing extends the cluster concept across organisational boundaries. The main idea is that resources from different organisations are brought together to promote collaboration between individuals, groups, or institutions, bypassing organisational boundaries.

Collaboration takes the form of a virtual organisation: a new virtual organisational entity that includes people from existing organisations, who access resources made available by participating organisations — including servers, databases, hard disks, and other infrastructure.

Key idea

A virtual organisation is a dynamic, temporary alliance of participants who pool resources to achieve a common goal. It is not a legal entity; it is a computational one. By their very nature, grid computing systems must deal with different administrative domains — each with its own policies, security requirements, and management practices.

This introduces several difficult problems:

7. Grid Middleware Architecture

Grid computing systems are built on a layered architecture defined by Foster, Kesselman, and Tuecke (2001). Each layer handles a specific set of concerns:

The grid middleware layer — consisting of the connectivity, resource, and collective layers — is the core of any grid system. Together, these layers provide uniform access to otherwise dispersed resources, hiding the heterogeneity of the underlying fabric and the complexity of multi-domain coordination.

Exam tip

Be able to name the five layers of the grid architecture (fabric, connectivity, resource, collective, application) and explain what each layer does. The three middle layers together form the grid middleware.

8. Distributed Information Systems: Transactions

The second major class — distributed information systems — originated from a different problem: many separate networked applications need to be integrated. The core difficulty is interoperability: how do you make independently-designed applications work together meaningfully?

The slides identify two sub-sorts:

The first sub-sort starts from the database world: when data is distributed across multiple servers, operations on that data must be distributed transactions that span servers while preserving correctness.

sequenceDiagram
    participant C as Client
    participant TM as Transaction Manager
    participant DB1 as Database A
    participant DB2 as Database B
    C->>TM: BEGIN TRANSACTION
    C->>TM: UPDATE account SET balance=balance-100 WHERE id=1
    TM->>DB1: Lock row, deduct 100
    DB1-->>TM: OK
    C->>TM: UPDATE account SET balance=balance+100 WHERE id=2
    TM->>DB2: Lock row, add 100
    DB2-->>TM: OK
    C->>TM: COMMIT
    TM->>DB1: Prepare
    DB1-->>TM: Ready
    TM->>DB2: Prepare
    DB2-->>TM: Ready
    TM->>DB1: Commit
    TM->>DB2: Commit
    Note over TM,DB2: Two-Phase Commit (2PC)

9. ACID Properties in a Distributed Setting

Operations on databases are usually performed in terms of transactions. When databases are distributed, transactions should be distributed. This requires special primitives from the distributed system or the runtime system, and it relies on the ACID properties:

PropertyMeaningDistributed Challenge
AtomicAll steps of a transaction occur invisibly to the outside world — either all happen or none happenEnsuring atomicity across multiple servers requires a coordination protocol (2PC). Partial failures can leave some servers committed and others aborted.
ConsistentThe transaction does not violate system invariantsInvariants may be distributed across servers. Checking consistency globally requires distributed snapshots or locking.
IsolatedConcurrent transactions do not interfere with each otherIsolation in a distributed system requires distributed locking or multi-version concurrency control across nodes.
DurableOnce a transaction commits, its effects are permanentDurability requires writing to stable storage at multiple sites. If one site crashes after commit but before persisting, the guarantee is violated.
Key idea

The ACID properties were designed for single-node databases. Achieving them in a distributed setting adds enormous complexity: atomicity requires two-phase commit (which blocks), isolation requires distributed locking (which risks deadlocks), and durability requires synchronous replication (which costs latency).

10. Nested Transactions

A nested transaction is a transaction made of a number of subtransactions. Nesting in transactions can be arbitrarily deep — each subtransaction may itself contain subtransactions. This structure is a natural way to distribute transactions: "leaf" subtransactions are ordinary transactions over single servers, and distributed transactions are nested transactions.

The Durability Problem with Nested Transactions

A whole nested transaction should exhibit ACID properties. However, this creates a tension: if a subtransaction commits but the top-level transaction later fails, the subtransaction’s effects should be undone even though they already committed. Durability at the subtransaction level would contradict atomicity at the top level.

The solution is the concept of a private copy of the world: all transactions are performed over a copy of the data. Subtransactions can maintain ACID properties within this "local world." The effects of a successful nested transaction are propagated to the real world only after the top-level transaction succeeds. If it fails, the private copy is simply discarded.

Exam tip

Understand the trade-off: subtransactions appear durable within the nested transaction, but their effects are only provisional from the perspective of the outside world. Durability refers to the top-level transaction, not the subtransactions.

11. TP Monitors and Enterprise Application Integration

Transaction Processing Monitors

An early solution for managing distributed transactions was the Transaction Processing (TP) Monitor. A TP monitor sits between clients and multiple database servers, allowing applications to access multiple DB servers with transactional semantics. It handles:

graph TB
    subgraph "Clients"
        C1["Client 1"]
        C2["Client 2"]
        C3["Client 3"]
    end
    subgraph "TP Monitor"
        TM["Transaction Manager
Queue Manager
Scheduler"] end subgraph "Databases" DB1["DB Server A"] DB2["DB Server B"] DB3["DB Server C"] end C1 --> TM C2 --> TM C3 --> TM TM --> DB1 TM --> DB2 TM --> DB3

Enterprise Application Integration

The second sub-sort of distributed information systems goes beyond database integration. It is not only a matter of accessing distributed databases — integration should happen at the application level too. Beyond data integration, what is needed is process integration: applications should interact and communicate meaningfully with each other as part of coordinated business processes.

The key architectural technique is using middleware as a communication facilitator. Middleware provides a common communication backbone that applications use to exchange messages, invoke services, and coordinate workflows — without each application needing to know the details of every other application’s implementation.

Editor’s note

EAI middleware is the conceptual ancestor of modern message queues (RabbitMQ, Kafka), service meshes (Istio, Linkerd), and API gateways. The problems are the same; the technologies have evolved.

12. Distributed Pervasive Systems

The third class — distributed pervasive systems — comes from asking a different question: what happens when instability is the default condition? Mobile devices with batteries, sporadic network connections, and no permanent human administrator — these are the defining characteristics of pervasive systems.

Key idea

Unlike computing and information systems that assume stable infrastructure, pervasive systems accept instability as the normal mode of operation. Devices come and go, networks partition and merge, batteries drain — and the system must keep working.

Two main features distinguish pervasive systems:

13. Requirements for Pervasive Systems

Grimm et al. (2004) identify three fundamental requirements that pervasive systems must satisfy:

  1. Embrace contextual changes: A device must be continually aware that its environment may change at any time. Network bandwidth fluctuates, services appear and disappear, user location changes. The system must adapt, not fail.

  2. Encourage ad hoc composition: Many devices in a pervasive system will be used in different ways by different users. There is no pre-planned "correct" way to use them. The system should support spontaneous, opportunistic composition — devices should be able to discover and use each other’s capabilities on the fly.

  3. Recognise sharing as the default: Devices generally join the system to access or provide information. Information should therefore be easy to read, store, manage, and share. The system should not require explicit permission for basic sharing — it should be the natural mode of operation.

Exam tip

These three requirements — contextual awareness, ad hoc composition, sharing as default — represent a fundamental shift from traditional systems design. Be prepared to explain how each differs from assumptions made by computing and information systems.

14. Home Systems and Health Care

The slides present two concrete application domains for pervasive systems:

Home Systems

Home systems are built around home networks. The defining constraint is that there is no way to ask people to act as competent network or system administrators — home systems must be self-configuring and self-maintaining.

Beyond network management, home systems must handle a huge amount of heterogeneous personal information coming from many sources, both inside and outside the home system: media files, documents, photos, sensor data, appliance status, energy usage, security camera feeds. The system must organise, protect, and make accessible all of this data without requiring users to understand its complexities.

Health Care Systems

Health care systems are personal systems built around a Body Area Network (BAN) — a collection of wearable sensors that monitor a patient’s vital signs (heart rate, blood pressure, glucose levels, etc.). The system must minimise impact on the person — preventing free motion would defeat the purpose of ambulatory monitoring.

Several critical questions must be addressed:

graph TB
    subgraph "Body Area Network"
        HR["Heart Rate Sensor"]
        BP["Blood Pressure"]
        GLU["Glucose Monitor"]
        ACT["Activity Tracker"]
    end
    subgraph "Local Hub"
        PHONE["Smartphone Hub"]
    end
    subgraph "Cloud Infrastructure"
        STORE["Data Storage"]
        ALERT["Alert Service"]
        DASH["Physician Dashboard"]
    end
    HR --> PHONE
    BP --> PHONE
    GLU --> PHONE
    ACT --> PHONE
    PHONE --> STORE
    PHONE --> ALERT
    STORE --> DASH
    ALERT --> DASH

15. Sensor Networks

Sensor networks are an enabling technology for pervasive systems. They consist of clouds of spatially-distributed sensors — from tens to thousands of nodes with a sensing device — acquiring, processing, and transmitting environmental information.

A useful way to view sensor networks is as distributed databases: distributed sources of information that can be queried over time. This perspective raises a fundamental design question: where does the computation happen?

Two Extremes (Both Bad)

ExtremeDescriptionProblem
All data to baseSensors send raw information to a central base station without cooperatingExcessive network consumption — thousands of sensors constantly transmitting data saturates the network
All computation at nodeSensors do all processing locally and return only resultsExcessive node power consumption — sensors have limited battery life and processing complex queries drains them quickly

The Solution: In-Network Data Processing

The answer lies between the two extremes: in-network data processing. The key ideas are:

This approach balances network consumption and power consumption: intermediate nodes combine partial results, reducing both the total data transmitted and the computation required at any single node.

graph TB
    BS["Base Station
(Root)"] --> R1["Region Router A"] BS --> R2["Region Router B"] R1 --> S1["Sensor 1"] R1 --> S2["Sensor 2"] R1 --> S3["Sensor 3"] R2 --> S4["Sensor 4"] R2 --> S5["Sensor 5"] S1 -.->|"temp=22.1"| R1 S2 -.->|"temp=22.4"| R1 S3 -.->|"temp=21.9"| R1 S4 -.->|"temp=23.0"| R2 S5 -.->|"temp=22.8"| R2 R1 -.->|"avg=22.1"| BS R2 -.->|"avg=22.9"| BS

Open questions that the research community continues to address:

16. Distributed Systems Nowadays: CNCF and Cloud Native

The increasing complexity of today’s distributed systems is only manageable by groups of engineers and practitioners. Community efforts are essential, particularly for critical infrastructure components. Open software has become the default — collections of components that provide a foundation for designing complex distributed systems are freely available.

It is fundamental to:

Key foundations include:

FoundationFocusURL
Apache Software Foundation (ASF)General-purpose open-source projectshttps://www.apache.org
Cloud Native Computing Foundation (CNCF)Cloud native technologieshttps://www.cncf.io/
Free Software Foundation (FSF)Free software advocacy and licensinghttps://www.fsf.org/
Python Software Foundation (PSF)Python ecosystemhttps://www.python.org/psf-landing/

The Cloud Native Computing Foundation

The CNCF, part of the Linux Foundation, is particularly significant for modern distributed systems. It collects and promotes projects that form the foundation of cloud native computing, organised by maturity stage:

StageDescriptionExamples
GraduatedProduction-ready, widely adoptedKubernetes, Prometheus, Envoy, etcd
IncubatingGrowing adoption, active developmentVitess, Cilium, Chaos Mesh
SandboxEarly-stage innovationsVarious experimental projects
ArchivedNo longer actively maintainedProjects that did not gain traction
Key idea

The CNCF Charter defines cloud native technologies as those that "empower organisations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds." The key techniques are containers, service meshes, microservices, immutable infrastructure, and declarative APIs. These enable loosely coupled systems that are resilient, manageable, and observable.

The full landscape of projects is visualised at landscape.cncf.io — a constantly-evolving map of the modern distributed systems ecosystem.

17. Conclusion: Convergence of the Three Classes

Diverse sorts of distributed systems exist, roughly telling the story of the evolution of distributed systems over the years:

Each class depends on the environment where it was developed, the goals it must achieve, and the level of available technologies. Different models, methodologies, and technologies are used to design and develop different sorts of distributed systems.

Key idea

Today, the complexity of distributed systems means they typically borrow features from all three classes. A cloud-native application (class 3, modern) might use Kubernetes for cluster-like resource management (class 1), rely on a distributed database with ACID transactions (class 2), and run on mobile devices with intermittent connectivity (class 3). The boundaries are blurring.

Looking forward, the complexity of distributed systems engineering calls for community efforts. Hundreds of critical software components are available to be looked up, selected, and used. The modern engineer must be both a designer and a curator, choosing the right components for the task at hand and understanding how they interact.

Check Your Understanding

What are the three fundamental classes of distributed systems, and what drives the differences between them?

The three classes are distributed computing systems (high-performance computing via clusters and grids), distributed information systems (transaction processing and enterprise application integration), and distributed pervasive systems (systems embedded in unstable, mobile environments). The differences are driven by the environment, goals, and available technologies at the time each class emerged.

Compare and contrast cluster computing with grid computing. What is the key distinguishing property?

The key distinguishing property is homogeneity vs. heterogeneity. Clusters are homogeneous: similar hardware, same OS, single LAN, single administrative domain. Grids are heterogeneous: diverse hardware, multiple OSes, wide-area networks, multiple administrative domains forming virtual organisations.

Describe the five layers of the grid computing architecture. Which layers form the grid middleware?

Fabric (local resources), Resource (single resource management), Connectivity (communication and security protocols), Collective (multi-resource coordination — discovery, allocation), Application (user-facing programs). The connectivity, resource, and collective layers together form the grid middleware.

Explain the ACID properties. Why is achieving them harder in a distributed setting?

Atomicity (all-or-nothing), Consistency (invariant preservation), Isolation (non-interference of concurrent transactions), Durability (permanence after commit). In a distributed setting: atomicity requires two-phase commit across servers (blocking under failures), isolation requires distributed locking (deadlock risk), durability requires synchronous writes to multiple sites (latency cost), and consistency must hold across distributed invariants.

What is a nested transaction? Why does it create a tension with durability?

A nested transaction is a transaction composed of subtransactions, possibly to arbitrary depth. The tension: if a subtransaction commits but the top-level transaction later fails, the subtransaction’s effects must be undone. Durability at the subtransaction level contradicts atomicity at the top level. The solution is a private copy of the world — subtransactions run on a copy, and results propagate only on top-level success.

What is a TP monitor, and what problems does it solve?

A Transaction Processing Monitor is middleware that sits between clients and multiple DB servers, allowing applications to access them with transactional semantics. It handles routing, distributed transaction coordination (2PC), load balancing, and recovery after server failures.

List and explain the three requirements for pervasive systems according to Grimm et al. (2004).

(1) Embrace contextual changes: devices must be continually aware of changes in their environment. (2) Encourage ad hoc composition: devices should support spontaneous, opportunistic discovery and use of each other’s capabilities. (3) Recognise sharing as the default: information should be easy to read, store, manage, and share without requiring explicit permissions for basic operations.

Why are both extremes of sensor network design (all data to base vs. all computation at node) suboptimal?

All-data-to-base saturates the network with raw transmissions from thousands of sensors. All-computation-at-node drains sensor batteries because each sensor must process complex queries locally. The solution is in-network data processing: build a tree, pass queries through it, aggregate results at intermediate levels — balancing network and power consumption.

What is the CNCF, and what are its four project maturity stages?

The Cloud Native Computing Foundation, part of the Linux Foundation, collects and promotes cloud native computing projects. The four stages are: Graduated (production-ready), Incubating (growing adoption), Sandbox (early-stage), and Archived (no longer maintained). Examples: Kubernetes (graduated), Cilium (incubating).

Explain why modern distributed systems "borrow features from all three classes." Give an example.

Modern systems are so complex that no single class addresses all needs. Example: a cloud-native IoT platform uses Kubernetes for resource management (class 1, computing), a distributed SQL database with ACID transactions (class 2, information), and runs on edge devices with intermittent connectivity (class 3, pervasive). The boundaries between classes have blurred as technology has matured.