Most software systems deployed today are distributed. They span multiple processes, multiple machines, and often multiple data centres. Containerization frameworks such as Docker opened the door to packaging and shipping these systems in a portable, reproducible way — but they did not solve the operational problem: once you have dozens or hundreds of containers, how do you keep them running?
Docker gave us a uniform packaging format. What it did not give us was a way to schedule containers across a fleet of machines, restart them when they fail, route traffic to healthy instances, scale them under load, or roll out updates without downtime. That is the job of a container orchestrator — and Kubernetes is now the de facto standard for that job.
Non-functional requirements — availability, scalability, reliability, recoverability, observability — acquire even more importance when a system is distributed. A single-container application that crashes is an annoyance. A hundred-container application where 5% of containers crash every hour is a disaster without automation. Kubernetes encodes these quality attributes directly into its architecture, providing declarative controls for all of them.
Why did the industry need a container orchestrator on top of container runtimes? Explain the specific operational gaps that Docker alone does not address and how Kubernetes fills them with concrete mechanisms.
Before Kubernetes, there were virtual machines — and before VMs became the dominant deployment model, there were containers. Containers offer a lightweight form of virtualization that shares the host operating system kernel while providing process and filesystem isolation through Linux kernel features such as namespaces and cgroups.
The VM era taught us that full OS-level virtualization carries heavy overhead: every VM runs its own kernel, consumes gigabytes of RAM, and boots slowly. Containers flip this model: they share the kernel but isolate user-space, achieving near-native performance with millisecond startup times. For a deeper treatment of Linux isolation primitives, see Giovanni Ciatto's lecture on Virtuale.
Docker became the most popular containerization platform, providing a user-friendly CLI, a layered image format, and a registry (Docker Hub) for distributing images. The key insight is that a container image is an immutable snapshot of an application and all its dependencies — the same image runs identically on a developer's laptop, a CI server, and a production cluster. This reproducibility is the precondition for everything Kubernetes builds on top.
Kubernetes describes itself as "a portable, extensible, open source platform for managing containerized workloads and services." The emphasis on platform is deliberate: Kubernetes is not a single tool but an ecosystem that provides primitives (scheduling, service discovery, scaling, self-healing) that other tools compose into complete solutions.
Equally important is understanding the boundaries. Kubernetes is not:
| Kubernetes is not... | What fills that gap |
|---|---|
| A containerization platform | Docker, containerd, CRI-O (any CRI runtime) |
| A PaaS (Platform as a Service) | It provides lower-level primitives; PaaS layers like OpenShift build on top |
| A cloud provider | Kubernetes runs on AWS, GCP, Azure, on-premises — it is cloud-agnostic |
| A CI/CD deployment tool | CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD) push to Kubernetes |
| A logging or monitoring tool | It offers integrations and APIs; tools like Prometheus, Grafana, and the EFK stack handle observability |
The separation of concerns is important. Kubernetes manages the runtime plane — keeping your containers running, healthy, and reachable. It deliberately leaves the development pipeline, logging storage, and monitoring dashboards to other tools in the cloud-native ecosystem. This modularity is a strength, not a limitation.
Think of Kubernetes as the distributed-system kernel of your production environment. It provides scheduling (like a process scheduler), networking (like a virtual network stack), and storage (like a volume manager) — but for containers spread across a cluster of machines rather than processes on a single host.
Docker Swarm is the native clustering and orchestration tool bundled with Docker. It is simpler to set up and operates within the familiar Docker ecosystem — the docker-compose.yml file is its primary configuration interface. For small clusters and less complex applications, Swarm can be a pragmatic choice.
Kubernetes, by contrast, is a full-featured orchestration platform. The trade-off is real: more complexity at setup time, but dramatically more automation at runtime.
docker-compose.ymlCompare Kubernetes and Docker Swarm along the axes of complexity, automation, scaling, and access control. In what scenarios would Swarm still be a reasonable choice, and at what point does the complexity investment in Kubernetes pay off?
Two philosophical pillars distinguish Kubernetes from imperative orchestration tools: immutability and declarative configuration.
In Kubernetes, resources are not meant to be mutated in-place after creation. If a change is needed, the resource is deleted and a new one is created. Containers themselves are designed to be ephemeral and stateless: deleting and recreating containers is a standard part of their lifecycle, not an exceptional event. This principle ensures consistency and reliability — the cluster always converges toward the declared desired state rather than accumulating drift from ad-hoc patches.
Kubernetes adheres to the principle "everything is an object." Many kinds of objects — Pods, Deployments, Services, ConfigMaps, Secrets, and dozens more — are available to shape the production environment. Configuration files are written in YAML (or JSON), and an external tool named kubectl manages the environment by applying these files:
kubectl create -f configuration-file.yaml
The declarative model works as follows: you describe the desired state in a manifest file, and Kubernetes continuously reconciles the actual state to match it. If a Pod crashes, Kubernetes notices and creates a replacement — not because you told it to, but because the desired state says "there should be N replicas" and the actual state is "there are N-1." This control loop is the engine of Kubernetes' automation.
Imperative systems (like shell scripts or Ansible playbooks) describe how to reach a state. Declarative systems describe what state you want. The difference matters at scale: when a declarative system encounters an unexpected condition, it keeps trying to converge. An imperative script just stops where it was written to stop. Declarative configuration directly improves configurability and maintainability.
Kubernetes supports automatic scaling based on resource usage or custom metrics, with two dimensions:
| Type | Mechanism | What it scales |
|---|---|---|
| Horizontal Scaling | Horizontal Pod Autoscaler (HPA) | Number of Pod replicas (up and down) |
| Vertical Scaling | Vertical Pod Autoscaler (VPA) | CPU and memory resources available to a container |
Autoscaling directly improves availability (more replicas handle more traffic) and scalability while optimizing resource usage — you do not provision for peak load 24/7.
Kubernetes continuously monitors the health of containers and nodes, and takes corrective action automatically:
These behaviours directly improve recoverability and reliability without operator intervention.
The CRI is a plugin interface that decouples Kubernetes from any specific container runtime. A CRI-compliant runtime must be running on each cluster node. Supported runtimes include containerd, CRI-O, and Mirantis Container Runtime — meaning Kubernetes is not tied to Docker only. This decoupling was a key architectural decision that allowed the ecosystem to evolve beyond Docker after Docker's own deprecation of dockershim.
Explain self-healing in Kubernetes with concrete examples. How does it differ from simply having a process supervisor like systemd at the node level? Why is the CRI abstraction important for the long-term health of the Kubernetes project?
A Kubernetes cluster is composed of a control plane plus a set of worker machines called nodes. This is a classic master-worker architecture, but with an important twist: every component on the control plane is itself designed for high availability and can be replicated.
graph TB
subgraph "Control Plane"
API[kube-apiserver]
ETCD[(etcd)]
SCHED[kube-scheduler]
CM[kube-controller-manager]
CCM[cloud-controller-manager]
end
subgraph "Worker Node 1"
K1[kubelet]
KP1[kube-proxy]
CR1[container runtime]
end
subgraph "Worker Node 2"
K2[kubelet]
KP2[kube-proxy]
CR2[container runtime]
end
subgraph "Worker Node N"
K3[kubelet]
KP3[kube-proxy]
CR3[container runtime]
end
API --- ETCD
API --- SCHED
API --- CM
API --- CCM
API --- K1
API --- K2
API --- K3
| Component | Role |
|---|---|
| kube-apiserver | Exposes the Kubernetes HTTP API. Every interaction with the cluster — from kubectl commands to internal component communication — goes through the API server. It is the front door. |
| etcd | A distributed key-value database that stores all cluster metadata and state. It is the source of truth for the entire cluster. |
| kube-scheduler | Watches for newly created Pods with no assigned node and selects a node for them to run on, based on resource availability, affinity rules, and policies. |
| kube-controller-manager | Runs controller processes — each controller is a control loop that watches the shared state through the API server and makes changes to move the current state toward the desired state. |
| cloud-controller-manager | Embeds cloud-specific control logic, allowing the core Kubernetes code to remain cloud-agnostic. |
| Component | Role |
|---|---|
| kubelet | The agent that runs on every node. It ensures that containers described in PodSpecs are running and healthy. |
| kube-proxy | A network proxy that maintains network rules on nodes, implementing the Service abstraction through iptables or IPVS rules. |
| container runtime | The software responsible for running containers (e.g., containerd, CRI-O). Must implement the Container Runtime Interface. |
The only component that talks directly to etcd is the API server. All other components read and write cluster state through the API server. This centralised data path is not a bottleneck — it is a design choice that enforces consistency, enables authentication and authorization at a single choke point, and makes the system auditable.
Objects are persistent entities in the Kubernetes system. They represent the state of the cluster: what containerized applications are running and on which nodes, the resources available to those applications, and the policies that govern their behaviour (restarts, upgrades, fault tolerance).
A Kubernetes object is a "record of intent." Once you create an object, the Kubernetes system will constantly work to ensure that the object exists — matching the actual state to the desired state you declared. This is the core reconciliation loop.
Every Kubernetes object includes two nested object fields that govern its behaviour:
The control plane continually manages every object's actual state to match the desired state supplied in the spec. This control loop is at the heart of Kubernetes' automation.
When creating an object via a manifest file (conventionally YAML), four fields are required:
| Field | Purpose |
|---|---|
apiVersion | The version of the Kubernetes API to use (e.g., v1, apps/v1) |
kind | The type of object being created (e.g., Pod, Deployment, Service) |
metadata | Data that uniquely identifies the object: name, namespace, UID, labels, annotations |
spec | The desired state of the object — the format varies per kind and contains nested fields specific to that object |
Here is a minimal manifest that creates a Pod running nginx:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Applied with: kubectl apply -f manifest.yaml
A Pod is the smallest deployable unit in Kubernetes. It represents a group of one or more containers with shared storage and network resources, and a specification for how to run them. The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation — exactly the same primitives that containers themselves use, but now shared among a co-located group of containers.
Containers within the same Pod share:
Why not manage containers directly? The Pod abstraction gives Kubernetes a uniform unit of scheduling and lifecycle management, regardless of whether the container runtime uses Docker, containerd, or CRI-O. It also enables advanced patterns (sidecars, init containers) without changing the scheduler.
apiVersion: v1
kind: Pod
metadata:
name: node-app
labels:
app: node
spec:
containers:
- name: node-container
image: node:18
command: ["node", "-e", "console.log('Hello World!')"]
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
ports:
- containerPort: 3000
protocol: TCP
Key resource fields:
100m = 0.1 CPU core = 10% of a core128Mi = 128 mebibytesA ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. It guarantees the availability of a specified number of identical Pods. When a ReplicaSet needs to create new Pods, it uses its Pod template. However, ReplicaSets are usually not directly managed by the user — they are controlled by a higher-level object called a Deployment.
A Deployment manages a set of Pods to run an application workload. It provides declarative updates for Pods and ReplicaSets, enabling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
The replicas field specifies the desired number of Pods. The selector tells the Deployment which Pods to manage (via label matching). The template contains the Pod specification used to create new Pods.
Deployments are not the only solution for managing workloads. Kubernetes provides specialised controllers for different use cases:
For running short-lived, one-off tasks. A Job creates one or more Pods and ensures that a specified number of them successfully terminate. Ideal for batch processing, data migration, or any finite-duration computation.
For regularly scheduled actions such as backups, report generation, log rotation, or periodic cleanup. A CronJob creates Jobs on a repeating schedule defined using cron syntax.
Manages Pods like a Deployment but maintains a sticky identity for each Pod. Pods are created from the same spec but are not interchangeable: each has a persistent identifier and stable network identity. Essential for stateful applications like databases (MySQL, PostgreSQL, MongoDB) where each instance has unique data.
Ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them; as nodes are removed, those Pods are garbage collected. Common use cases: log collection daemons (Fluentd, Filebeat), monitoring agents (Node Exporter), and storage daemons.
When would you use a StatefulSet instead of a Deployment? Explain the concept of "sticky identity" and why stateless and stateful workloads require different controllers in Kubernetes.
Pods are ephemeral — they can be created, destroyed, and moved between nodes, which means their IP addresses are subject to change. A Service provides a stable endpoint (a fixed IP address and DNS name) to access a set of Pods, using label selectors to identify the target Pods.
A Service is the entry point of the application. It defines a logical set of Pods and a policy by which to access them, redirecting requests to the appropriate Pods regardless of where they are running in the cluster.
| Service Type | Behaviour |
|---|---|
| ClusterIP (default) | Exposes the Service on a cluster-internal IP. Only reachable from within the cluster. Suitable for internal communication between microservices. |
| NodePort | Exposes the Service on each Node's IP at a static port (range 30000-32767). Accessible from outside the cluster via <NodeIP>:<NodePort>. |
| LoadBalancer | Exposes the Service externally using a cloud provider's load balancer (or MetalLB for on-premises deployments). Creates a NodePort and ClusterIP Service automatically, then provisions an external load balancer that forwards to the NodePort. |
apiVersion: v1
kind: Service
metadata:
name: node-service
labels:
app: node
spec:
selector:
app: node
ports:
- protocol: TCP
port: 80
targetPort: 3000
type: ClusterIP
The port is the Service port (client-facing); the targetPort is the container port inside the Pod. The selector must match the labels on the target Pods — this is how the Service discovers its backends dynamically as Pods come and go.
Services decouple who needs to be reached (the application) from where it currently runs (the Pod IPs). This is exactly the same pattern as DNS for physical machines: a hostname stays constant even when the underlying IP changes. In Kubernetes, a Service provides a virtual IP and DNS name that survive individual Pod restarts and rescheduling.
Understanding the Pod lifecycle is essential for debugging and for designing applications that gracefully handle Kubernetes' operational model. A Pod traverses several phases from creation to termination, and the transitions are driven by the kubelet on the node where the Pod is scheduled.
Pods are never "repaired" in place. If a container crashes, the kubelet restarts it (subject to the restart policy), but if a node fails entirely, Pods are rescheduled on a different node as new Pods with new IPs. Applications must be designed to tolerate this: store state externally (in a volume or database), use Services for discovery, and never assume a fixed IP or hostname survives a rescheduling event.
The lifecycle phases are:
The lecture presents a concrete migration of a simple web application — a visit counter composed of a Flask backend and a Redis in-memory store. The full source is available at github.com/Mala1180/kubernetes-hpa-example.
The application is minimal: a GET / endpoint increments a hit counter in Redis and returns the count along with the serving Pod's hostname. A GET /busy-wait endpoint performs a CPU-intensive operation for stress testing. The Docker Compose file defines two services — backend and redis — with the backend depending on Redis:
services:
backend:
image: username/kubernetes-example-backend:latest
build: .
ports:
- "3000:3000"
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
depends_on:
- redis
redis:
image: redis:7-alpine
The application is developed locally using Docker Compose. The depends_on directive ensures Redis starts before the backend, and environment variables wire the components together. The backend connects to Redis at redis:6379 — the Compose DNS resolves redis to the Redis container's IP.
Each Compose service is mapped to two Kubernetes objects: a Deployment and a Service. An additional Horizontal Pod Autoscaler object is defined for the backend.
Backend Deployment — defines replicas, the container image (pushed to Docker Hub), environment variables for Redis connectivity, and resource requests/limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
labels:
app: backend
spec:
replicas: 1
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
containers:
- name: backend
image: mala1180/kubernetes-example-backend:latest
imagePullPolicy: Always
ports:
- containerPort: 3000
env:
- name: REDIS_HOST
value: "redis"
- name: REDIS_PORT
value: "6379"
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Backend Service — exposes the backend externally via NodePort:
apiVersion: v1
kind: Service
metadata:
name: backend
labels:
app: backend
spec:
type: NodePort
ports:
- port: 3000
targetPort: 3000
protocol: TCP
selector:
app: backend
The Redis service follows an analogous pattern (Deployment + Service with ClusterIP type, since Redis should not be exposed externally).
The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of Pod replicas in a Deployment based on observed resource utilisation or custom metrics. In the example, the HPA is configured to trigger when CPU usage exceeds 50% or memory usage exceeds 70%, scaling between 1 and 10 replicas:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: backend-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: backend
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
The following interactive stepper simulates the sequence of operations from initial deployment through scaling under load. Step through each action to see how the cluster state evolves:
To validate the autoscaling, a stress-test script generates load on the backend:
The practical workflow for a local deployment (using Minikube):
# Start a local Kubernetes cluster
minikube start
# Enable the metrics server (required for HPA)
minikube addons enable metrics-server
# Apply all manifests
kubectl apply -f k8s/
# Access the application
minikube service backend
# or: kubectl port-forward svc/backend 8080:3000
# Open the monitoring dashboard
minikube dashboard
Walk through what happens when a stress-test script hits a Kubernetes-deployed application behind an HPA. Describe the sequence of events from initial overload to recovery, naming each Kubernetes component involved and the data it observes or acts upon.
So far we have focused on scalability, availability, and reliability — but another critical quality attribute is observability. A production system will eventually go down for some reason; an observable system allows operators to understand what happened and what is happening.
The standard monitoring stack for Kubernetes combines two open-source tools:
| Tool | Role |
|---|---|
| Prometheus | Open-source systems monitoring and alerting toolkit. It scrapes metrics from instrumented applications and Kubernetes components, stores them as time-series data, and provides a powerful query language (PromQL). It can also fire alerts based on threshold conditions. |
| Grafana | Open-source web interface for analytics and monitoring. It connects to Prometheus (and many other data sources) to visualise time-series data through dashboards, allowing operators to see CPU usage, memory consumption, request rates, error counts, and HPA activity at a glance. |
The cloud-native observability stack is often described through three pillars: metrics (Prometheus — quantitative data over time), logs (Elasticsearch/Fluentd/Kibana or Loki — structured event records), and traces (Jaeger or OpenTelemetry — end-to-end request flows across services). Kubernetes integrates with all three but does not mandate any specific tool — the CRI and the metrics API provide the integration points.
During the stress test, the Grafana dashboard visualises metrics collected by Prometheus, showing the spike in CPU/memory that triggers the HPA, the creation of new Pods, and the eventual stabilisation as load is distributed across replicas.
Kompose is a conversion tool that automates the translation of Docker Compose files into Kubernetes manifests, saving time during migration. It supports Kompose-specific labels within the compose.yml file to explicitly control how resources are generated:
services:
backend:
...
labels:
kompose.service.type: nodeport
kompose.controller.type: deployment
kompose.image-pull-policy: always
The conversion is a single command:
kompose convert -f docker-compose.yml -o k8s/
Kompose generates Deployment, Service, and other object manifests in the output directory. While automated conversion is a useful starting point, production Kubernetes manifests typically require manual refinement: resource requests and limits, health checks (liveness/readiness probes), security contexts, ConfigMaps for configuration, and Secrets for sensitive data are all Kubernetes-native concepts with no direct Docker Compose equivalent.
Kompose handles the mechanical translation of Compose syntax to Kubernetes syntax, but it does not add Kubernetes-specific best practices. A generated manifest lacks probes, resource tuning, pod disruption budgets, affinity rules, and network policies. Treat Kompose output as a starting template, not a production-ready configuration.
Let's synthesise the complete picture. Moving a distributed application from development to production on Kubernetes involves a deliberate pipeline that maps architectural concerns to Kubernetes abstractions:
The result is a system that is:
| Quality Attribute | Kubernetes Mechanism |
|---|---|
| Availability | ReplicaSets ensure multiple Pod instances; Services route around failed instances |
| Scalability | Horizontal Pod Autoscaler adjusts replica count based on real metrics |
| Reliability | Self-healing replaces failed containers and reschedules Pods from failed nodes |
| Recoverability | Declarative desired state + reconciliation loop = automatic recovery from failures |
| Observability | Prometheus metrics scraping + Grafana dashboards + alerting rules |
| Maintainability | Declarative YAML manifests under version control; rollout/rollback for zero-downtime updates |
Kubernetes is best understood as a distributed-system kernel — it provides process scheduling (Pods on nodes), inter-process communication (Services + DNS), storage (PersistentVolumes), and resource isolation (namespaces, cgroups, RBAC) across a cluster of machines. The difference from a traditional OS kernel is that the "processes" are containers, the "machines" are nodes, and failure is treated as a normal condition rather than an exceptional one.
Docker alone provides container packaging, running, and basic networking. Kubernetes adds: (1) automated scheduling — deciding which node runs which Pod based on resource availability and constraints; (2) self-healing — continuously replacing failed containers and rescheduling Pods from failed nodes to healthy ones; (3) declarative autoscaling — automatically adjusting the number of replicas based on CPU, memory, or custom metrics via the HPA; (4) service discovery and load balancing — providing stable IPs/DNS names for ephemeral Pods through Services; (5) rolling updates and rollbacks — deploying new versions with zero downtime and the ability to revert.
The Spec is the user-declared desired state. The Status is the current observed state, continuously updated by Kubernetes components. The distinction is fundamental because it enables the reconciliation loop: Kubernetes controllers continuously compare Status to Spec and take actions to close the gap. This is what makes Kubernetes self-healing and declarative — the user does not write imperative scripts; they declare intent, and the system works to make it true. Without this split, Kubernetes would be a static deployment tool rather than a continuously operating control system.
Pods are ephemeral — their IP addresses change every time they are recreated or rescheduled. Hard-coding IPs would break every time a Pod restarts. Label selectors provide an indirection layer: a Service targets "all Pods with label app=backend" rather than "Pod at 10.244.1.5." The Service controller dynamically updates the list of backend endpoints as Pods matching the selector are created and destroyed, ensuring the Service always routes to healthy, current Pods. This is the same pattern as DNS for physical hosts: a name stays constant while the underlying address changes.
1. User runs kubectl apply -f deployment.yaml, which sends the manifest to kube-apiserver. 2. The API server validates the request and writes the Deployment object to etcd. 3. The Deployment controller (part of kube-controller-manager) observes the new Deployment, creates a ReplicaSet object, and updates etcd. 4. The ReplicaSet controller observes the new ReplicaSet, creates Pod objects (with no assigned node), and updates etcd. 5. The kube-scheduler watches for unassigned Pods, selects a suitable worker node for each based on resource availability and constraints, and updates the Pod's node assignment in etcd. 6. The kubelet on the assigned node observes that a Pod has been scheduled to its node, instructs the container runtime (via CRI) to pull the image and start the container. 7. The kubelet reports the Pod's status back to the API server, which updates the Status field in etcd.
A StatefulSet is the right choice when Pods need stable, unique identities that persist across rescheduling. Concretely: a three-node PostgreSQL cluster where each instance stores a different partition of data. With a Deployment, if a Pod dies, the replacement gets a new name and IP — but the data on the old Pod's volume must be reattached to the correct role (primary or replica). A StatefulSet assigns each Pod a stable ordinal name (postgres-0, postgres-1, postgres-2), stable network identities, and stable storage — when postgres-1 dies, its replacement is also named postgres-1 and gets the same PersistentVolume. Deployments treat all Pods as interchangeable; StatefulSets preserve identity.
replicas: 3. One worker node crashes entirely. Describe what Kubernetes does, step by step, and why.1. The kubelet on the crashed node stops sending heartbeats to the API server. After a timeout (default ~40 seconds), the node controller marks the node as NotReady. 2. The Pods on that node are marked as Terminating (or Unknown). 3. After a further grace period, the node controller evicts the Pods. 4. The ReplicaSet controller observes that the actual number of healthy Pods (0, since they were evicted) is less than the desired number (3 from the Deployment spec). 5. The ReplicaSet controller creates three new Pod objects. 6. The scheduler assigns them to healthy nodes (assuming sufficient capacity). 7. The kubelet on the target nodes starts the new containers. The system converges back to 3 replicas. If a Service is fronting these Pods, kube-proxy updates the endpoint list to include the new Pods and remove the dead ones, maintaining continuous availability for the surviving 2/3 of capacity during the transition.
The CRI is a plugin interface that decouples Kubernetes from any specific container runtime. Before CRI, Kubernetes was tightly coupled to Docker via an internal component called dockershim. The CRI allows any runtime that implements the interface — containerd, CRI-O, Mirantis Container Runtime — to be used interchangeably. This was important for three reasons: (1) it prevented vendor lock-in to Docker; (2) it allowed the community to develop purpose-built runtimes optimized for Kubernetes (like CRI-O); (3) when Docker deprecated dockershim in Kubernetes 1.20, the ecosystem could transition smoothly because all major runtimes already supported CRI.
nginx:1.14 to nginx:1.16? How does Kubernetes prevent downtime?The Deployment controller creates a new ReplicaSet for the updated Pod template. It then gradually scales up the new ReplicaSet while scaling down the old one, respecting the maxSurge and maxUnavailable parameters. Typically: one new Pod is created; once it is healthy and ready, one old Pod is terminated; this repeats until all Pods are from the new ReplicaSet. The Service continues routing traffic to both old and new Pods during the transition, and because at least the desired number of Pods are always available (thanks to maxUnavailable=0 or 1), there is no downtime. If the new Pods fail health checks, the rollout can be automatically rolled back.
Kompose performs a mechanical syntax translation from Compose to Kubernetes. It typically misses: (1) liveness and readiness probes — without them, Kubernetes cannot detect hung vs. healthy containers; (2) resource requests and limits — without them, the scheduler cannot make informed placement decisions; (3) security contexts — Compose has no concept of PodSecurityContext, non-root users, or read-only root filesystems; (4) Pod disruption budgets — to protect availability during voluntary disruptions like node drains; (5) ConfigMaps and Secrets — Compose uses environment variables or files; Kubernetes has first-class objects for configuration and secrets management; (6) network policies — Compose networks are permissive by default; production Kubernetes should restrict inter-service communication.
Metrics: quantitative data over time (CPU usage, request rate, error count). Collected by Prometheus, which scrapes endpoints and stores time-series data. Visualised in Grafana dashboards. Logs: structured or semi-structured event records emitted by applications and system components. The lecture references monitoring integration points; the standard Kubernetes logging stack includes Fluentd (collection) + Elasticsearch (storage) + Kibana (visualisation) — the EFK stack — though not covered in detail in this lecture. Traces: end-to-end request flows across services. While not covered in the lecture slides, tools like Jaeger or OpenTelemetry integrate with Kubernetes to provide distributed tracing, showing latency breakdowns per service hop.
Not always. If the selector matches zero Pods, the Service's endpoint list is empty — any request to the ClusterIP will be rejected (connection refused or timeout), so it cannot forward traffic. However, the Service still provides a stable DNS name and ClusterIP that can exist before Pods are ready. This is useful for bootstrapping: you can create all Services first, then deploy Pods later. Once Pods matching the selector appear, the Service starts routing to them automatically — no reconfiguration needed. So the Service is not useless; it is an inert name reservation waiting for backends. This is standard practice in GitOps workflows where manifests are applied in dependency order.
Several possibilities: (1) Downstream bottleneck — the backend may be waiting on Redis or a database that cannot scale, so adding more backend replicas does not improve throughput; (2) Node resource exhaustion — if all nodes in the cluster are at capacity and the cluster autoscaler cannot add new nodes fast enough (or at all), new Pods remain in Pending state; (3) HPA metric lag — the metrics server polls every 15-30 seconds, and the HPA has a stabilisation window, so there can be a delay before scaling kicks in; (4) connection pool saturation — each Pod may have a limited connection pool to an external resource, and adding Pods saturates the downstream faster; (5) CPU is not the bottleneck — the real constraint might be I/O, network bandwidth, or file descriptors, not CPU, and the HPA is watching the wrong metric.