Kubernetes doesn’t just use etcd; it’s fundamentally built on etcd’s ability to reliably store and serve distributed state.
Let’s watch it in action. Imagine you have a Deployment that needs to create three ReplicaSet pods. The kube-controller-manager watches for changes in the Deployment object. When it sees the Deployment is ready to scale up, it creates a ReplicaSet object. This ReplicaSet object, like all Kubernetes objects, is written to etcd.
# Example Deployment object (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
The kube-scheduler then sees the new ReplicaSet object in etcd and assigns pods to specific nodes. It writes the pod object, with its assigned nodeName, back to etcd. Finally, the kubelet on each node watches etcd for pods assigned to its node. When it sees a pod assigned to it, it starts the containers. Every single change, every desired state, every actual state update, flows through etcd.
The core problem Kubernetes solves is managing a dynamic, distributed system where components need to agree on a single source of truth for the cluster’s state. Without a robust, consistent, and highly available distributed key-value store like etcd, Kubernetes would devolve into a chaotic mess of competing agents. It provides the necessary foundation for distributed consensus, leader election, and reliable state replication that makes a container orchestrator possible.
Internally, etcd is a distributed, consistent, and highly available key-value store. It uses the Raft consensus algorithm to ensure that all nodes in the etcd cluster agree on the order of operations and the current state. When a client (like a Kubernetes controller) writes data to etcd, the write is proposed to the etcd leader. The leader then replicates this write to its followers. Once a majority of followers have acknowledged the write, it’s considered committed and applied to the state machine. This process guarantees that etcd is strongly consistent – any read will return the most recently committed write.
The "keys" in etcd are hierarchical paths, much like a filesystem. For example, a pod definition might be stored at /registry/pods/default/my-pod-name. Kubernetes uses these paths to organize its objects. The "values" are the serialized representations of Kubernetes objects, typically in JSON or Protocol Buffers format. etcd also supports watch notifications. When a client watches a specific key or prefix, etcd can notify the client whenever that key or its children change. This is how Kubernetes controllers stay informed about state changes and react accordingly.
The levers you control are primarily through how you configure etcd itself and how you interact with it via the Kubernetes API. This includes setting resource quotas and limits, defining network policies that control access to etcd (though direct access is rare and discouraged), and understanding the implications of etcd’s performance on cluster operations. For instance, if etcd becomes a bottleneck, operations like scaling up deployments or creating new services will slow down dramatically because the controllers are waiting for etcd to acknowledge their writes.
A crucial, often overlooked, aspect of etcd’s operation within Kubernetes is its role in distributed locking. When multiple controllers might try to modify the same resource simultaneously, etcd’s compare-and-swap (CAS) operations are essential. A controller reads a resource, makes a modification, and then attempts to write it back only if the resource version hasn’t changed since it was read. If the version has changed (meaning another controller modified it first), the write fails, and the controller must re-read, re-apply its logic, and try again. This optimistic locking prevents race conditions and ensures that only one controller successfully updates a resource at a time, maintaining data integrity.
The next concept you’ll grapple with is how Kubernetes leverages etcd’s watch mechanism for efficient event-driven reconciliation.