The relationship between computing and time is far more delicate than it first appears. Almost all computational devices today are not just general-purpose computers: cars, medical devices, industrial robots, toys, communication systems, and the exploding Internet of Things (IoT) embed computational power into physical contexts. These special-purpose computers are increasingly networked and intelligent, yet their growing complexity can come at the expense of dependability.
Conversely, general-purpose computers are ever more required to interact with physical processes: they integrate video and audio, sense physical dynamics through personal devices, and control physical actuators in pervasive systems. Computing is no longer confined to the abstract manipulation of symbols — it is embedded in the real world, and the real world runs on physical time.
Time has always been abstracted away from computing. The Turing-Church-von Neumann foundations are about the transformation of data, not about the dynamics of physical processes. As computing merges with the physical world, this abstraction becomes a liability.
The central observation, due to Edward A. Lee [Lee, 2009], is that computing needs time as a first-class semantic property, not merely as a performance metric. The core abstractions of computing should be rethought to include time.
Cyber-Physical Systems (CPS) emerge from the deep integration of physical systems and processes with networked computing. CPS embed computations and communication into physical processes to add new capabilities to physical systems. The fundamental challenge is that the foundations of computing — Turing, Church, von Neumann — do not deal with the dynamics of physical processes.
CPS are about merging computing and networking with physical systems to create new science, techniques, and products. The passage of time is essentially absent in traditional computing — and this absence is the core problem CPS must address.
High-confidence medical devices, assisted living systems, traffic control and safety, advanced automotive systems, process control, energy conservation, environmental control, avionics and instrumentation, critical infrastructure control (electric power, water, communications), distributed robotics (telepresence, telemedicine), defence systems, manufacturing, and smart structures.
The technological basis that engineers and computer scientists have chosen for general-purpose computing and networking does not support well the applications dealing with physical processes [Lee, 2009]. The passage of time is essentially absent in computing — time is abstracted away, modelled out, considered a non-functional concern. CPS force us to confront this absence directly.
A program that produces the correct result at any speed is functionally correct in the classical sense — but in a CPS, a result delivered too late is equivalent to a wrong result. Time is not a performance issue; it is a correctness issue.
Edward A. Lee [Lee, 2009] identifies four pervasive misconceptions about time in computing. Each of these hides a deeper truth that becomes critical in distributed and embedded systems.
| Misconception | Reality |
|---|---|
| Computing takes time — efficiency has limits, but time is abstracted away. | Time is always abstracted away from computing, so computing always omits time. It is not just that efficiency matters; time is removed from the semantic model entirely. |
| Time is a resource — like memory or CPU cycles, but a different kind. | Time is practically unbounded and is expended regardless of whether we use it. While generic resource management works for memory or energy, time needs to be a semantic property of programs, not a resource to be managed. |
| Time is a nonfunctional property — functions are computed by abstracting from time. | With a Turing Machine, a function can be computed by abstracting from time. But in a CPS, the function of a computation is defined through actions occurring in physical time with a duration. Time defines the function, not just its performance. |
| Real time is a QoS problem — about efficiency and meeting deadlines. | Time in CPS is mostly not about efficiency. It is about predictability and repeatability. Precision and variability in timing are quality-of-service issues, but time itself is much more than that: time should be part of the semantics of programs. |
The subtitle says it all: core computing abstractions should be re-thought to include time. This lesson explores what that means for distributed systems.
Before we ask how time should enter computing, we must ask a more fundamental question: what notion of time are we talking about? The Britannica encyclopedia defines time as “a measured or measurable period, a continuum that lacks spatial dimensions” — but this definition is deeply intertwined with philosophical, mathematical, and scientific investigation.
Our common-sense notion of time — uniform, universal, ticking away independently of everything else — is precisely the notion that modern physics has shown to be incorrect. If computing must deal with time, it must deal with the real nature of time, not with a naive idealisation.
The slides ask “what about our common sense notion of time?” as a rhetorical question. The answer is that common sense is a poor guide: we think of time as a single, universal, flowing river, but physics tells us otherwise.
Einstein's relativity [Einstein, 1920] shattered the classical notion of a single universal time. Carlo Rovelli [Rovelli, 2017] articulates the consequences with striking clarity:
You may be asked: “Why is there no universal notion of time in a distributed system?” The foundational reason is not just engineering — it is physics. There is no universal time, period. Distributed systems inherit this fundamental property of the universe.
A precise understanding of time in distributed systems requires that we recast the distinctions among parallel, concurrent, and distributed computing in terms of temporal and spatial contexts.
A computational system performs parallel computation when the temporal context is the same for all computational processes: all processes share the same notion of time T. This is the classical multi-core or SIMD model, where all processing elements are synchronised to a common clock.
graph LR
subgraph "Same Temporal Context T"
P1["Process 1"] --- P2["Process 2"]
P2 --- PN["Process N"]
end
A system performs concurrent computation when at least two computational processes have a different temporal context: T ≠ T'. Each process has its own local notion of time, and these notions are not inherently synchronised. This is the natural model for multi-threading on separate cores, processes on the same machine with independent scheduling, or any scenario where processes advance at their own rate.
graph LR
subgraph "Temporal Context T"
P1["Process 1"]
end
subgraph "Temporal Context T'"
P2["Process 2"]
end
A system performs distributed computation when at least two computational processes have a different spatial context: S ≠ S'. The processes run on physically separate machines connected by a network. Distribution is defined by space, not time.
graph LR
subgraph "Spatial Context S"
P1["Process 1"]
end
subgraph "Spatial Context S'"
P2["Process 2"]
end
P1 -.->|Network| P2
| System Type | Spatial Contexts | Temporal Contexts |
|---|---|---|
| Distributed parallel | S ≠ S' | Same T |
| Distributed concurrent | S ≠ S' | T ≠ T' |
Distributed concurrent computing — where both spatial and temporal contexts differ — is the most natural and general model for distributed systems. It combines the challenges of independent clocks (concurrency) with physically separate nodes (distribution). Most real distributed systems operate in this regime.
In centralised, non-distributed systems, time is unambiguous: there is a single clock, a single temporal context. A process obtains the time through a system call to the kernel, and any subsequent call returns a strictly larger value. This gives a total ordering of events with no effort.
In a distributed system, there is no natural notion of time. Each machine has its own local clock, and these clocks drift apart. There is no shared memory, no shared clock, no global synchronisation primitive. This leads to the fundamental questions:
The dictionary defines “synchronous” as “existing or occurring at the same time.” In a distributed system, there is no way to determine whether two events on different machines occurred at exactly the same time — because there is no universal “same time” to measure them against. This is the root of the synchronisation problem.
A clock in a computer is actually a timer. Typically, it consists of an oscillating quartz crystal, a counter, and a holding register. The quartz oscillates at a precise frequency; the counter decrements with each oscillation; when the counter reaches zero, an interrupt is generated and the counter is reloaded from the holding register. Each interrupt is a clock tick.
In a centralised system with a single clock, a process gets the time by issuing a system call; another process doing the same later will get a strictly larger value. This provides a total ordering of events reliably and without effort.
In a distributed system, each process has its own internal clock and its own notion of time. Each clock ticks independently, and they drift apart — potentially by several seconds per day. Even if all clocks are synchronised at system start (set to the same value), differences in tick rates cause them to diverge over time.
The naive solution — “just synchronise the clocks” — turns out to be deep and subtle. The course asks: do we actually need synchronisation? The answer, as we will see, is: it depends on what you need time for.
Given two machines a and b in a distributed system, with their clocks Ca and Cb, the following precise terminology [Kshemkalyani and Singhal, 2011] lets us reason about clock behaviour:
| Term | Definition | Formula | Intuition |
|---|---|---|---|
| Time | The value a clock returns at real time t | Ca(t) | For a perfect clock, Ca(t) = t |
| Frequency | The rate at which a clock progresses | C'a(t) | First derivative of the clock function |
| Offset | Difference between a clock and real time (or between two clocks) | Ca(t) − t (absolute) Ca(t) − Cb(t) (relative) |
How far apart two clocks are at a given instant |
| Skew | Frequency difference between a clock and a perfect clock (or between two clocks) | C'a(t) − C'b(t) | How fast the offset is growing or shrinking |
| Drift | The second derivative of the clock function | C''a(t) | How the skew itself changes over time (e.g., due to temperature) |
Offset is a value (the difference in readings now), skew is a rate (how fast the offset grows), and drift is a change in rate (acceleration). Understanding these three levels is essential to designing synchronisation algorithms.
Manufacturers typically specify the maximum skew rate ρ for a clock, also known as the maximum drift rate. A timer (clock) is said to be working within its specification if:
1 − ρ ≤ dC/dt ≤ 1 + ρ
This inequality says that a clock's frequency can deviate from perfect by at most ρ. For typical quartz clocks, ρ is about 10−6 (1 part per million), corresponding to a drift of roughly 86 milliseconds per day. Temperature, voltage, age, and manufacturing tolerances all contribute to ρ.
Relative to UTC, a clock can be:
The clock offset grows linearly with time for a constant skew. After time Δt, the offset of a fast clock with skew ρ is approximately ρ × Δt. For example, a clock with ρ = 10−5 (10 ppm) will drift about 0.86 seconds per day.
Clock drift is a physical phenomenon, not a bug. Quartz crystals change their oscillation frequency with temperature, pressure, and age. A clock that is perfect at 25°C may be significantly slow at 0°C or fast at 50°C. In space or high-altitude applications, radiation can also affect timing.
Physical clock synchronisation is the process of ensuring that physically distributed processors have a common notion of time. It is needed for three main purposes: knowing the time of day at which an event happened on a specific machine, measuring the time interval between two events on different machines, and determining the relative ordering of events across machines.
UTC is the international standard for timekeeping, maintained by the BIPM (Bureau International des Poids et Mesures) in Sevres, France. BIPM combines and averages the atomic time standards of member nations to create a single official time. UTC is broadcast as short-wave radio pulses (WWV) by NIST every UTC second, and also via GPS and other satellites.
If one machine in a distributed system has access to a UTC service, an algorithm can synchronise all other machines based on this reference.
NTP (RFC 5905 for v.4) is the standard protocol for clock synchronisation over the Internet. A hierarchy of time servers, from Stratum 0 (atomic clocks) to Stratum 1 (directly connected to Stratum 0) to Stratum 2 and below, propagates accurate time. Key design constraint: clocks can only run forward — corrections cannot bring clocks backward, as this could break causality (e.g., file timestamps).
sequenceDiagram
participant C as Client
participant S as NTP Server
Note over C: T1 = local time
C->>S: NTP request (T1)
Note over S: T2 = receive time (server)
Note over S: T3 = transmit time (server)
S->>C: NTP response (T1, T2, T3)
Note over C: T4 = receive time (client)
Note over C: delay = (T4-T1)-(T3-T2)
offset = ((T2-T1)+(T3-T4))/2
The key insight is that NTP estimates both the one-way network delay and the clock offset by measuring the round-trip time and assuming symmetric network delays. The client computes:
Simulate how two clocks drift apart. The UTC reference runs perfectly; the local clock can run fast or slow.
Understand the NTP offset formula: offset = ((T2 − T1) + (T3 − T4)) / 2. This assumes symmetric network delays. Why is this assumption problematic? Because in real networks, the forward and reverse paths can have very different latencies (asymmetric routing, congestion).
Not all applications need UTC. The required temporal properties vary widely, and sometimes a UTC service is not available at all. This leads to a crucial distinction:
| Global Absolute Time | Global Relative Time | |
|---|---|---|
| Reference | UTC (or another external standard) | An internal agreement among processes |
| Accuracy | Bounded by NTP/Cesium clocks | Bounded by the synchronisation protocol |
| External meaning | Matches real-world time | Meaningful only within the system |
| Examples | NTP, GPS time sync | Berkeley Algorithm, RBS |
The Berkeley Algorithm [Gusella and Zatti, 1989] synchronises machines using only their mutual agreement, without requiring any machine to have UTC. A time daemon polls all participating machines, collects their local times, computes an average (discarding outliers if necessary), and sends each machine the required correction. Since no machine has the “true” time, the algorithm creates a relative global time that all machines share.
RBS [Elson et al., 2002] is designed for wireless networks. Instead of a client synchronising with a server, all receivers synchronise with each other based on a reference broadcast message. The key insight: the broadcast arrives at all receivers at approximately the same time (since radio propagation is near light-speed). Receivers record their local time upon reception and exchange these timestamps to compute relative offsets.
Absolute time is needed when events must be correlated with the real world (e.g., logging for legal compliance, coordinating with external systems). Relative time suffices when only internal consistency matters (e.g., maintaining consistent ordering in a distributed database, coordinating distributed sensors in the same field).
Physical time is not always necessary. In many cases, the only actual requirement is a shared notion of time among processes, with no need for it to correspond to real time. A step further: often the only need is a shared ordering of events, not a shared clock at all. This observation leads to logical time — a notion of time that is both possible and useful without any physical clock.
A classic example [Tanenbaum and van Steen, 2017] motivates the need for logical time in the absence of synchronised physical clocks:
sequenceDiagram
participant LA as LA Replica
participant NY as NY Replica
Note over LA: Account: $1000
Note over NY: Account: $1000
Note over LA: User adds $100
Note over NY: Bank applies 1%
LA->>NY: Update: +$100
NY->>LA: Update: +1%
Note over LA: LA result: $1110
(1000+100+1% of 1000)
Note over NY: NY result: $1111
(1000+1%+100)
Two concurrent updates — a customer adds $100 to the account while a bank employee applies a 1% interest increment — produce inconsistent results at the two replicas. The LA replica ends up with $1110; the NY replica with $1111. The problem is that the updates are concurrent and their ordering is ambiguous without a shared notion of time.
The bank example reveals that inconsistency arises not from clock imprecision but from clock absence. Even perfect physical clocks would not solve the problem if the updates arrive at different replicas in different orders — we need to order the updates, not just timestamp them.
Synchronisation is possible with no need to be absolute. If two processes do not interact, there is no need for synchronisation — the lack would not be observable. What really matters is not the exact time when events occur, but the order in which events occur. This is precisely the insight behind make and Gradle, which use file modification timestamps to decide which targets to rebuild: they need relative ordering, not absolute time.
Leslie Lamport's 1978 paper “Time, Clocks, and the Ordering of Events in a Distributed System” introduced logical clocks as a mechanism to capture causal order (the “happened-before” relation) without physical synchronisation. Each process p maintains an integer counter Lp, its logical clock, updated as follows:
| Event Type | Rule |
|---|---|
| Internal event | Lp ← Lp + 1 |
| Send event | Lp ← Lp + 1; send message with timestamp Lp |
| Receive event | Lp ← max(Lp, tsmsg) + 1 |
These three rules ensure the clock condition: if event a happens before event b (a → b), then L(a) < L(b). Note that the converse does not hold: L(a) < L(b) does not imply a → b; the clocks can produce false positives. This is the price of a lightweight, decentralised mechanism.
Step through the processes below to see how Lamport clocks maintain causal ordering. Each process can execute internal events, send messages, or receive them. The shared state shows the logical clock values.
Lamport clocks give us a consistent ordering: if two events are causally related, their logical timestamps reflect that. But they do not give us a total ordering consistent with causality — concurrent events may be ordered arbitrarily. Vector clocks extend Lamport clocks to capture causality exactly, at the cost of larger timestamps.
Sometimes more articulated policies are required beyond ordering. For instance, ensuring that concurrent accesses to a shared resource do not corrupt its consistency requires mutual exclusion. Many distributed algorithms address this problem:
The slides do not review mutual exclusion algorithms in detail; the key point is that some are based on a coordinator and all of them are coordination algorithms. This is a conceptual bridge to the next topic.
Many distributed algorithms require a coordinator to be elected among the participating processes. Election algorithms (e.g., Bully algorithm, Ring algorithm) select one process as the leader. These are not reviewed in detail here either, but they are coordination algorithms used by other coordination algorithms — a recurring pattern in distributed systems.
Why distinguish synchronisation from coordination? Synchronisation is about when things happen; coordination is about what happens, based on both time and the nature of the actions. Synchronisation alone cannot solve problems where the type of action matters — such as ensuring mutual exclusion or electing a leader.
The final slide section frames the overarching problem that goes beyond time:
It is not merely a matter of time. Synchronisation is about when things happen. Actions are more than sending messages. Interaction does not merely translate into suitably-ordered distributed actions — undifferentiated actions. Actions have a nature, and meaningful interaction within a distributed system typically depends on such a nature.
Coordination [Omicini et al., 2001] is the problem of governing interaction based on both time and the nature of actions, aimed at the achievement of some global objective for the distributed system. It is the union of:
Coordination = synchronisation + action semantics. The “nature” of actions matters: a read and a write to the same variable are not interchangeable, and a coordinator cannot be replaced by just any process. Time gives us ordering; coordination gives us meaningful ordering.
This lesson has traced the relationship between computing and time from philosophical foundations to practical distributed algorithms:
| Topic | Key Takeaway |
|---|---|
| Time & computation | Computing abstracts time away, but CPS require time as a semantic property. The traditional notion of a single universal time is physically inaccurate. |
| Time in distributed systems | No global clock, no shared memory, no total ordering. Each machine has its own clock with offset, skew, and drift. |
| Physical time / clock | Synchronisation to UTC via NTP is possible but expensive and not always needed. The Berkeley algorithm provides relative time without UTC. |
| Logical time / clock | Lamport clocks capture causal ordering with simple integer counters, sufficient for many distributed algorithms at very low cost. |
| Toward coordination | Events have a nature: synchronisation is not enough. Coordination governs interaction based on both time and action semantics. |
The next logical step is vector clocks, which extend Lamport clocks to capture causality exactly (not just partially), and causality-based algorithms for distributed debugging, checkpointing, and consistent snapshots.
Those foundations are about the transformation of data in the abstract; they deliberately abstract away from time and physical dynamics. A CPS merges computation with physical processes, where actions occur in physical time with duration. A program that produces the correct result but too late is incorrect in a CPS. Time must be a first-class semantic property, not a non-functional concern.
(1) “Computing takes time” → No, time is abstracted away, computing omits it. (2) “Time is a resource” → It is unbounded and expended anyway; it needs to be a semantic property. (3) “Time is a nonfunctional property” → In CPS, time defines the function, not just its performance. (4) “Real time is a QoS problem” → It is about predictability and repeatability, not just efficiency.
Given timestamps T1 (request sent), T2 (request received), T3 (response sent), T4 (response received): if forward and reverse delays are equal (= δ/2 where δ is the round-trip delay), then offset θ = T2 − (T1 + δ/2) = ((T2 − T1) + (T3 − T4))/2. If the assumption is violated (asymmetric delays), the computed offset includes a systematic error up to half the difference between forward and reverse delays.
Offset is the instantaneous difference between two clock readings (like the distance between two runners at a specific moment). Skew is the rate of change of offset (the difference in their speeds). Drift is the change in skew over time (their acceleration or deceleration). A quartz clock might have offset of 5 ms (it reads 5 ms ahead), skew of 10 ppm (it gains 10 μs per second), and drift that changes with temperature.
The clock condition L(a) < L(b) whenever a → b is guaranteed by the increment-on-event rule combined with the max-plus-one rule on receive. But the converse can fail because concurrent events can be assigned arbitrary logical times that happen to satisfy L(a) < L(b) even though neither happened before the other. The logical clock adds enough ordering information to enforce the forward direction of causality but not enough to identify concurrency precisely.
Essential: high-frequency trading — regulators need to know the exact UTC time of each trade to detect fraud, and millisecond differences can determine which order was placed first. Logical clocks suffice: the bank database example — what matters is the order of updates (which came first), not their absolute time. A Lamport clock can ensure all replicas apply updates in the same order, even without knowing the actual wall-clock time.
NTP provides global absolute time synchronised to UTC, using a hierarchical server architecture. Berkeley provides global relative time among a group of machines without requiring any UTC reference. Use NTP when external time correlation matters (logging, legal compliance, coordination with external systems). Use Berkeley when only internal consistency is needed (sensor networks, cluster-internal coordination) or when UTC access is unavailable.
Synchronisation ensures that events occur in a particular order (ordering messages by logical clock). Mutual exclusion ensures that conflicting actions (e.g., two writes to the same variable) do not overlap. Coordination is the general problem of governing interaction based on both time and the nature of actions, aimed at a global system objective. Mutual exclusion is a form of coordination; synchronisation is a tool for coordination.
Relativity tells us there is no universal “now” — different observers in different frames of reference experience time differently. Distributed systems face an analogous problem: processes on different machines have different clocks, no shared “present,” and no way to define simultaneity absolutely. The distributed systems problem is not just an engineering inconvenience; it reflects a fundamental property of the universe we inhabit.
It bounds the clock frequency deviation from perfect: 1−ρ ≤ dC/dt ≤ 1+ρ. It gives a worst-case bound on how fast clocks can drift apart. Synchronisation algorithms use ρ to determine how frequently they must re-synchronise: if the required precision is ε and the worst-case relative skew between two clocks is 2ρ, then re-synchronisation must occur at intervals of at most ε/(2ρ).