StatefulSets on Kubernetes are more than just pods that remember their names; they’re the bedrock for building robust distributed applications by providing stable identities and persistent storage that survives restarts.
Imagine you have a distributed database like etcd. You need to ensure that:
- Each instance of etcd has a consistent network identity (its hostname) so other instances can find and talk to it.
- Each instance has its own persistent storage, so if a pod dies and is rescheduled, it gets its data back.
- You can manage these instances as a group, ensuring a specific number are always running and that they start up in a predictable order.
This is precisely what StatefulSets are designed for. Let’s look at a minimal etcd StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: etcd
spec:
serviceName: "etcd-headless"
replicas: 3
selector:
matchLabels:
app: etcd
template:
metadata:
labels:
app: etcd
spec:
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.4.13
ports:
- containerPort: 2379 # client port
- containerPort: 2380 # peer port
command:
- etcd
- --name=$(ETCD_NAME)
- --initial-advertise-peer-urls=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2380
- --listen-peer-urls=http://0.0.0.0:2380
- --listen-client-urls=http://0.0.0.0:2379,http://127.0.0.1:2379
- --advertise-client-urls=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2379
- --initial-cluster=$(ETCD_NAME)=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2380
- --initial-cluster-state=new
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: ETCD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
storageClassName: standard
Here’s how it works:
serviceName: "etcd-headless": This is crucial. It defines a headlessService. A headless service doesn’t get a cluster IP. Instead, DNS lookups foretcd-headless.your-namespace.svc.cluster.localwill return the IP addresses of all pods belonging to this StatefulSet. This allows pods to discover each other by hostname.replicas: 3: We want three etcd instances. Kubernetes will ensure this number is maintained.template: This is the standard pod template.command: This is where the magic happens for distributed consensus.--name=$(ETCD_NAME): Each etcd instance is given a unique name corresponding to its pod name (e.g.,etcd-0,etcd-1,etcd-2).--initial-advertise-peer-urls=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2380: This tells other etcd peers how to reach this instance. Notice the fully qualified domain name (FQDN) pattern:pod-name.service-name.namespace.svc.cluster.local. This is how etcd members find each other.--initial-cluster=$(ETCD_NAME)=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2380: This bootstraps the initial cluster configuration. When the first etcd pod starts, it uses this to tell the other members (if they exist) about itself.
volumeClaimTemplates: This is the core of persistent identity. For each pod created by the StatefulSet, a correspondingPersistentVolumeClaim(PVC) will be generated. The naming convention isvolume-claim-name-pod-name. So, foretcd-0, you’ll get a PVC nameddata-etcd-0. This PVC will bind to aPersistentVolume(PV) and provide stable storage. Whenetcd-0is rescheduled, it will re-attach to the same PV, preserving its data.- Stable Network Identity: Pods get predictable hostnames:
etcd-0.etcd-headless.your-namespace.svc.cluster.local,etcd-1.etcd-headless.your-namespace.svc.cluster.local, and so on. This is key for distributed systems where members need to address each other by name. - Ordered Deployment and Scaling: StatefulSets deploy pods in order (0, 1, 2…) and scale them up and down in reverse order (2, 1, 0…). This is critical for systems that require a specific startup or shutdown sequence, like many consensus protocols.
The most surprising true thing about StatefulSets is that they don’t inherently provide leader election; they provide the stable identity and storage that enable distributed systems to implement leader election reliably.
Let’s see how this plays out with leader election in a distributed system like etcd. When etcd starts, it uses its stable identity and the FQDNs of its peers to form a cluster. Each etcd member participates in a Raft consensus algorithm. Raft handles the leader election automatically: if the current leader fails, the remaining members will elect a new leader from among themselves. The StatefulSet guarantees that the etcd pods, with their stable identities and persistent data, are always available to participate in this election process. If a pod dies, Kubernetes restarts it, it reattaches its persistent volume, and rejoins the cluster to participate in consensus.
Consider the command section again. The $(POD_NAME) and $(POD_NAMESPACE) environment variables, populated by fieldRef, are then used to construct the FQDNs. This dynamic construction ensures that each pod correctly identifies itself and its peers within the cluster, regardless of which node it lands on or if it’s rescheduled. The storageClassName: standard is a placeholder; you’d replace standard with the actual name of a StorageClass available in your cluster that provisions your desired storage type (e.g., gp2 on AWS, pd-ssd on GCE, or local storage).
The volumeClaimTemplates section is where the persistence is defined. accessModes: [ "ReadWriteOnce" ] means the volume can be mounted as read-write by a single node. For most block storage (like EBS, GCE PD), this is the standard. resources: requests: storage: 1Gi specifies that each pod will request a 1Gi persistent volume. Kubernetes will provision a PV of at least this size and attach it to the pod.
The next concept you’ll grapple with is how to manage distributed locks on top of this stable foundation, often using tools like ZooKeeper or etcd itself.