The etcd leader is dropping heartbeats to followers, causing them to fall behind and potentially triggering a leader election, which is a major disruption.
Cause 1: Network Latency/Packet Loss
- Diagnosis: Use
pingandtraceroutebetween the leader and follower nodes. Look for high latency (>5ms consistently) or packet loss.ping -c 100 <follower-ip> traceroute <follower-ip> - Fix: Identify and resolve network issues. This might involve:
- Checking physical cabling and network switches.
- Configuring Quality of Service (QoS) on network devices to prioritize etcd traffic.
- Adjusting TCP/IP parameters on the nodes (e.g.,
net.ipv4.tcp_keepalive_time,net.ipv4.tcp_keepalive_intvl). - Why it works: etcd relies on fast, reliable communication. Reducing latency and eliminating packet loss ensures heartbeats and proposal acknowledgments reach the leader promptly, keeping followers in sync.
- Example Fix: If latency is high, you might investigate upstream network hops or faulty NICs. If packet loss is observed, replacing a suspect switch or network cable could resolve it.
Cause 2: Insufficient Disk I/O Performance
- Diagnosis: Monitor disk I/O on the leader and follower nodes. etcd writes its WAL (Write-Ahead Log) and snapshots synchronously. Slow disks will bottleneck these operations.
iostat -xz 5 # Look for high %util, await, or high queue depths on the device etcd uses. - Fix: Migrate etcd data directory to faster storage (e.g., SSDs, NVMe). Ensure the filesystem is optimized (e.g.,
noatimemount option).
Why it works: etcd’s performance is heavily dependent on its ability to quickly write to disk for durability. Faster I/O reduces the time it takes to commit log entries, allowing the leader to respond faster to followers and process more requests.# Example: Remounting with noatime mount -o remount,noatime /var/lib/etcd - Example Fix: If
iostatshows sustained%utilnear 100% and highawaittimes on/dev/sda1where/var/lib/etcdresides, migrating to an NVMe drive (/dev/nvme0n1) and configuring etcd to use that path will significantly improve write performance.
Cause 3: CPU Starvation/High Load on Leader Node
- Diagnosis: Monitor CPU utilization on the etcd leader node. etcd is CPU-bound when processing proposals and handling client requests.
top -H -n 5 -d 1 # Look for consistently high CPU usage (>80%) across cores, especially by etcd processes. - Fix:
- Reduce the load on the leader node by offloading client requests to other nodes if possible (though leader is still responsible for consensus).
- Ensure the etcd nodes are not running other heavy applications.
- Consider dedicated nodes for etcd.
- Tune etcd’s
heartbeat-intervalandelection-timeout(increase cautiously).
Why it works: Giving etcd more CPU cycles allows it to process incoming requests and outgoing heartbeats more quickly. Increasing timeouts provides more buffer for slow processing without triggering elections prematurely.# Example etcd config snippet etcd --heartbeat-interval=200 --election-timeout=1000 - Example Fix: If
topshows etcd processes consuming over 90% CPU, and other applications like Prometheus exporters are also running on the same node, moving those applications to separate nodes or increasing the node’s CPU resources will alleviate the bottleneck.
Cause 4: Etcd Configuration Issues (Timings)
- Diagnosis: Review etcd configuration parameters, specifically
heartbeat-intervalandelection-timeout. If these are too aggressive (low values), followers might time out waiting for heartbeats even under normal load. - Fix: Increase
heartbeat-intervalandelection-timeout. The defaultheartbeat-intervalis 100ms, andelection-timeoutis typically 1000ms (which is 10 * heartbeat interval). A common starting point for troubled clusters isheartbeat-interval=200andelection-timeout=1000or2000.
Why it works: A larger# Example etcd config file snippet heartbeat-interval: 200 election-timeout: 1000heartbeat-intervalmeans the leader sends heartbeats less frequently, reducing network traffic and leader processing load. A largerelection-timeoutgives followers more time to receive heartbeats before initiating an election, accommodating transient network or processing delays. - Example Fix: If the cluster is configured with
heartbeat-interval: 50andelection-timeout: 500, changing these toheartbeat-interval: 200andelection-timeout: 1000provides more breathing room.
Cause 5: Under-provisioned etcd Nodes (Memory/CPU)
- Diagnosis: General system resource monitoring. Are the etcd nodes meeting the recommended minimums for CPU and RAM?
etcd generally recommends at least 2 CPU cores and 4GB of RAM per node, with more needed for larger clusters or higher request loads.free -h nproc - Fix: Scale up the nodes running etcd. This means adding more CPU cores or RAM. Why it works: etcd’s Raft consensus algorithm, while efficient, still requires sufficient computational and memory resources to operate smoothly, especially under load. Under-provisioning leads to general sluggishness that manifests as slow follower sync.
- Example Fix: If etcd nodes only have 1 CPU core and 2GB RAM, upgrading them to 4 cores and 8GB RAM will provide ample headroom.
Cause 6: Network Interface Card (NIC) Offloading Issues or Driver Bugs
- Diagnosis: Check system logs (
dmesg,syslog) for NIC-related errors or warnings. Examine NIC offloading settings.
Look forethtool -k <interface-name>tx-checksummingorrx-checksummingwhich can sometimes cause issues with high-throughput network traffic. - Fix: Disable problematic offloading features or update NIC drivers.
Why it works: Certain NIC offloading features, especially checksum offloading, can sometimes be buggy or interact poorly with specific network stacks or high-speed traffic, leading to corrupted packets or dropped traffic that disrupts etcd’s communication.# Example: Disable tx/rx checksumming sudo ethtool -K <interface-name> tx off rx off - Example Fix: If
dmesgshows warnings about checksum offload errors andethtool -k eth0showstx-checksumming: on, disabling it withethtool -K eth0 tx offmight resolve intermittent packet loss.
The next error you’ll likely see is etcdserver: mvcc: database space exceeded if disk I/O was the bottleneck and the database grew too large during the slow periods.