The hardest part of cloud-native distributed systems isn’t orchestrating containers; it’s reliably managing stateful services that need to survive pod restarts and scaling events.
Imagine a distributed key-value store running on Kubernetes. We want to see how it handles a node failure and subsequent recovery.
Here’s a simplified setup using etcd, a popular distributed key-value store, deployed as a StatefulSet in Kubernetes.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: etcd-cluster
spec:
serviceName: etcd
replicas: 3
selector:
matchLabels:
app: etcd
template:
metadata:
labels:
app: etcd
spec:
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.5.9
ports:
- containerPort: 2379 # Client port
- containerPort: 2380 # Peer port
command:
- /usr/local/bin/etcd
args:
- --name=$(POD_NAME)
- --initial-advertise-peer-urls=http://$(POD_NAME).etcd.default.svc.cluster.local:2380
- --listen-peer-urls=http://0.0.0.0:2380
- --advertise-client-urls=http://$(POD_NAME).etcd.default.svc.cluster.local:2379,http://localhost:2379
- --listen-client-urls=http://0.0.0.0:2379
- --initial-cluster-token=etcd-cluster-1
- --initial-cluster=$(POD_NAME).etcd.default.svc.cluster.local:2380,etcd-1.etcd.default.svc.cluster.local:2380,etcd-2.etcd.default.svc.cluster.local:2380
- --initial-cluster-state=new
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: etcd-data
mountPath: /etcd-data
volumeClaimTemplates:
- metadata:
name: etcd-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
This StatefulSet defines three etcd pods. Notice the serviceName: etcd, which creates a stable DNS entry for the headless service etcd.default.svc.cluster.local. This allows pods to discover each other using predictable hostnames like etcd-0.etcd.default.svc.cluster.local. The volumeClaimTemplates ensure each pod gets its own persistent volume for data storage, crucial for stateful applications.
When a pod (etcd-0) is terminated (e.g., by deleting the pod or draining the node), Kubernetes will attempt to reschedule it. Because it’s a StatefulSet, it will try to bring it back with the same identity and attach its persistent volume.
Let’s simulate a failure. We’ll delete etcd-0:
kubectl delete pod etcd-0
Kubernetes will terminate etcd-0. The other two etcd instances (etcd-1, etcd-2) will detect the loss of etcd-0 through their peer communication. etcd uses the Raft consensus algorithm. If a majority of nodes (2 out of 3 in this case) remain available, the cluster can continue to operate and serve requests.
Once etcd-0 is rescheduled, it will start up, re-attach its persistent volume, and attempt to rejoin the cluster. It will use its stored data to discover its peers and re-establish its membership.
We can check the cluster status:
ETCDCTL_API=3 etcdctl --endpoints=http://etcd-0.etcd.default.svc.cluster.local:2379,http://etcd-1.etcd.default.svc.cluster.local:2379,http://etcd-2.etcd.default.svc.cluster.local:2379 endpoint health
You’ll see all endpoints reporting healthy. If you inserted data before the failure, it will still be there, demonstrating state persistence.
The core problem this solves is providing stable identity and persistent storage to distributed applications on a dynamic, ephemeral platform like Kubernetes. Without StatefulSets, pods are treated as disposable, and their data would be lost on restart. The stable network identifiers (etcd-0.etcd.default.svc.cluster.local) and dedicated persistent volumes are the bedrock upon which reliable distributed state is built in this environment.
A common pitfall is forgetting to configure the initial-cluster correctly, especially when scaling up or down. If the initial-cluster string doesn’t accurately reflect the current intended members with their peer URLs, new nodes might fail to join or existing nodes might not discover each other properly, leading to split-brain scenarios or an unresponsive cluster. The $(POD_NAME) variable, combined with the headless service, is key here, as it dynamically generates the correct peer URLs based on the pod’s name and its DNS entry.
The next challenge is typically implementing robust leader election and failover for services built on top of this stateful infrastructure.