etcd’s quorum requirement means that a majority of nodes must be available for the cluster to operate, and this doesn’t change when you stretch it across regions.
Let’s say you have an etcd cluster with nodes in us-east-1, us-west-2, and eu-central-1.
apiVersion: etcd.io/v1beta2
kind: EtcdCluster
metadata:
name: example-cluster
spec:
replicas: 3
version: 3.5.9
image: quay.io/coreos/etcd:v3.5.9
tls:
static:
member:
caBundle: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
peer:
caBundle: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
pod:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: etcd
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: etcd
- maxSkew: 1
topologyKey: topology.kubernetes.io/region
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: etcd
This configuration ensures that etcd pods are distributed across availability zones and regions. The topologySpreadConstraints are crucial here. The first one attempts to spread pods across zones within a region, and the second one specifically targets spreading them across regions. This is the core mechanism for achieving multi-region resilience. If one region goes down, the remaining nodes in other regions can still form a quorum.
The problem etcd solves is providing a distributed, consistent key-value store. For Kubernetes, this means it’s the single source of truth for cluster state. When you stretch etcd across regions, you’re extending that resilience to your entire Kubernetes control plane. If an entire AWS region becomes unavailable, your Kubernetes API server and other control plane components can continue to function as long as a quorum of etcd members remain reachable.
Internally, etcd uses the Raft consensus algorithm. Each etcd member has a copy of the data. When a write operation occurs, it’s proposed to the leader. The leader replicates the write to a majority of followers. Once a majority acknowledges the write, it’s committed and applied to the state machine. This ensures that all members have a consistent view of the data. In a multi-region setup, "majority" still means a majority of the total etcd members, regardless of their geographic location. Network latency between regions becomes a critical factor in performance.
The most surprising thing about stretching etcd across regions is that the performance characteristics of your cluster will be dominated by the highest latency link between any two etcd members. Even if you have nodes in adjacent regions with low latency, a single node in a far-flung region can dramatically slow down your entire cluster’s write operations because the leader must wait for acknowledgements from that slow node to achieve quorum.
Consider a scenario where your etcd cluster has members in us-east-1 (10ms latency to us-west-2) and eu-central-1 (80ms latency to us-east-1). If the leader is in us-east-1 and needs to replicate a write to a majority (at least 2 out of 3 nodes), it must communicate with the node in eu-central-1. The round-trip time for this replication will be at least 160ms (80ms out, 80ms back), effectively capping your write latency for the entire cluster. This is why careful placement and understanding of network topology are paramount.
The topologySpreadConstraints in the Kubernetes etcd operator, or equivalent configurations in other deployment methods, are what actively enforce this distribution. Without them, a default scheduler might place all your etcd pods in the same region or even the same availability zone, defeating the purpose of multi-region deployment. The maxSkew parameter controls how unevenly pods can be distributed. A maxSkew of 1 means that the difference in the number of pods in any two zones (or regions) can be at most 1.
The next challenge you’ll face is managing backups and disaster recovery strategies that account for the distributed nature of your data and the potential for regional outages.