etcd’s primary failure mode is losing quorum, meaning a majority of its members can no longer communicate with each other.

Let’s say you have a 3-member etcd cluster, and one member dies. The remaining two members can still talk to each other, so you have quorum. If a second member dies, the remaining single member is now isolated. It can’t reach a majority of the cluster (which would be at least 2 out of 3), so it stops accepting writes to prevent data divergence. This is quorum loss.

Here’s how to recover:

Cause 1: Network Partition

Diagnosis: Check network connectivity between etcd nodes. ping is too basic; you need to check the specific ports etcd uses (default 2379 for client, 2380 for peer).

# On etcd-1, try to connect to etcd-2's peer port
nc -zv etcd-2 2380

# On etcd-1, try to connect to etcd-2's client port
nc -zv etcd-2 2379

If these fail, a firewall rule, security group, or routing issue is likely.

Fix: Reconfigure firewall rules or security groups to allow traffic on ports 2379 and 2380 between all etcd members. For example, if using ufw on Ubuntu:

sudo ufw allow from <etcd-2-ip> to any port 2380 proto tcp
sudo ufw allow from <etcd-2-ip> to any port 2379 proto tcp

Why it works: etcd members communicate over TCP ports 2379 (client requests) and 2380 (peer-to-peer gossip). Restoring this communication allows the remaining members to form a quorum.

Cause 2: etcd Member Crashed or Unresponsive

Diagnosis: Check the status of the etcd process on the affected node.

# On the etcd node
systemctl status etcd

Look for recent error messages or a dead process.

Fix: Restart the etcd service.

# On the etcd node
systemctl restart etcd

Why it works: If the etcd process crashed due to a transient error, restarting it allows it to rejoin the cluster and potentially re-establish quorum.

Cause 3: Corrupted etcd Data Directory

Diagnosis: Examine etcd logs for errors related to data corruption or inability to read the Raft state.

# On the etcd node
journalctl -u etcd -f

Look for messages like "etcdserver: mvcc: database space exceeded" or "etcdserver: apply entries failed".

Fix: This is the most destructive fix. You’ll need to restore from a backup. First, stop the etcd service on all remaining nodes to prevent further writes. Then, on the node you’re restoring to, stop etcd, remove its data directory (typically /var/lib/etcd), and restore from a known good backup using etcdctl snapshot restore. You’ll then need to reconfigure the other members to join this restored cluster.

# On all etcd nodes
systemctl stop etcd

# On the node where you're restoring
rm -rf /var/lib/etcd
etcdctl snapshot restore <path-to-snapshot-file> \
  --data-dir /var/lib/etcd \
  --initial-cluster <new-initial-cluster-config> \
  --initial-cluster-token <your-cluster-token> \
  --initial-advertise-peer-urls <your-node-peer-url>

# Then, start etcd on this node and reconfigure others to join.

Why it works: Corrupted data means the Raft consensus log is unreadable. Restoring from a backup effectively rewrites the data directory with a known consistent state.

Cause 4: Incorrect etcd Configuration

Diagnosis: Verify the configuration file (often /etc/etcd/etcd.conf.yml or similar) on all members. Mismatched initial-cluster, listen-peer-urls, or advertise-client-urls can cause nodes to fail to connect.

Fix: Ensure initial-cluster lists all members with their peer URLs, and that listen-peer-urls and advertise-client-urls are correctly set for each node’s IP address and ports. For example, on etcd-1:

# /etc/etcd/etcd.conf.yml on etcd-1
name: etcd-1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://<etcd-1-ip>:2379
listen-peer-urls: http://<etcd-1-ip>:2380
initial-advertise-peer-urls: http://<etcd-1-ip>:2380
initial-cluster: etcd-1=http://<etcd-1-ip>:2380,etcd-2=http://<etcd-2-ip>:2380,etcd-3=http://<etcd-3-ip>:2380

After correcting the config, restart etcd.

Why it works: etcd relies on accurate peer discovery and advertisement to form its cluster. Correcting these URLs ensures nodes can find and communicate with each other correctly.

Cause 5: Resource Exhaustion (Disk Space, Memory, CPU)

Diagnosis: Monitor system resources on the etcd nodes.

# Check disk space
df -h /var/lib/etcd

# Check memory and CPU usage
top
htop

etcd can be sensitive to disk I/O and memory pressure.

Fix: Free up disk space, increase available memory, or optimize resource allocation. For disk space, consider enabling etcd’s compaction and retention settings to prune old revisions.

# In etcd configuration or via etcdctl
--auto-compaction-retention 168h # Retain 1 week of revisions
--quota-backend-bytes 8589934592 # 8GB quota

Why it works: etcd needs reliable disk access and sufficient memory to store its Raft log and state machine. Resource starvation can lead to timeouts and member instability.

Cause 6: Clock Skew Between Nodes

Diagnosis: Check the time on all etcd nodes.

date

Significant drift (more than a few seconds) can cause Raft leadership election issues.

Fix: Configure all etcd nodes to synchronize their clocks using NTP.

# Example for systemd-timesyncd on Ubuntu
sudo timedatectl set-ntp true

Why it works: Raft relies on timed events and leader heartbeats. Clock skew can cause nodes to perceive heartbeats as missed, leading to unnecessary leader elections or instability.

After resolving quorum loss, you might immediately hit an error related to leader election if the cluster is still unstable, or if a newly elected leader is struggling to catch up.

Want structured learning?

Take the full Etcd course →