The etcd leader is dropping heartbeats to followers, causing them to fall behind and potentially triggering a leader election, which is a major disruption.

Cause 1: Network Latency/Packet Loss

  • Diagnosis: Use ping and traceroute between the leader and follower nodes. Look for high latency (>5ms consistently) or packet loss.
    ping -c 100 <follower-ip>
    traceroute <follower-ip>
    
  • Fix: Identify and resolve network issues. This might involve:
    • Checking physical cabling and network switches.
    • Configuring Quality of Service (QoS) on network devices to prioritize etcd traffic.
    • Adjusting TCP/IP parameters on the nodes (e.g., net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl).
    • Why it works: etcd relies on fast, reliable communication. Reducing latency and eliminating packet loss ensures heartbeats and proposal acknowledgments reach the leader promptly, keeping followers in sync.
  • Example Fix: If latency is high, you might investigate upstream network hops or faulty NICs. If packet loss is observed, replacing a suspect switch or network cable could resolve it.

Cause 2: Insufficient Disk I/O Performance

  • Diagnosis: Monitor disk I/O on the leader and follower nodes. etcd writes its WAL (Write-Ahead Log) and snapshots synchronously. Slow disks will bottleneck these operations.
    iostat -xz 5
    # Look for high %util, await, or high queue depths on the device etcd uses.
    
  • Fix: Migrate etcd data directory to faster storage (e.g., SSDs, NVMe). Ensure the filesystem is optimized (e.g., noatime mount option).
    # Example: Remounting with noatime
    mount -o remount,noatime /var/lib/etcd
    
    Why it works: etcd’s performance is heavily dependent on its ability to quickly write to disk for durability. Faster I/O reduces the time it takes to commit log entries, allowing the leader to respond faster to followers and process more requests.
  • Example Fix: If iostat shows sustained %util near 100% and high await times on /dev/sda1 where /var/lib/etcd resides, migrating to an NVMe drive (/dev/nvme0n1) and configuring etcd to use that path will significantly improve write performance.

Cause 3: CPU Starvation/High Load on Leader Node

  • Diagnosis: Monitor CPU utilization on the etcd leader node. etcd is CPU-bound when processing proposals and handling client requests.
    top -H -n 5 -d 1
    # Look for consistently high CPU usage (>80%) across cores, especially by etcd processes.
    
  • Fix:
    • Reduce the load on the leader node by offloading client requests to other nodes if possible (though leader is still responsible for consensus).
    • Ensure the etcd nodes are not running other heavy applications.
    • Consider dedicated nodes for etcd.
    • Tune etcd’s heartbeat-interval and election-timeout (increase cautiously).
    # Example etcd config snippet
    etcd --heartbeat-interval=200 --election-timeout=1000
    
    Why it works: Giving etcd more CPU cycles allows it to process incoming requests and outgoing heartbeats more quickly. Increasing timeouts provides more buffer for slow processing without triggering elections prematurely.
  • Example Fix: If top shows etcd processes consuming over 90% CPU, and other applications like Prometheus exporters are also running on the same node, moving those applications to separate nodes or increasing the node’s CPU resources will alleviate the bottleneck.

Cause 4: Etcd Configuration Issues (Timings)

  • Diagnosis: Review etcd configuration parameters, specifically heartbeat-interval and election-timeout. If these are too aggressive (low values), followers might time out waiting for heartbeats even under normal load.
  • Fix: Increase heartbeat-interval and election-timeout. The default heartbeat-interval is 100ms, and election-timeout is typically 1000ms (which is 10 * heartbeat interval). A common starting point for troubled clusters is heartbeat-interval=200 and election-timeout=1000 or 2000.
    # Example etcd config file snippet
    heartbeat-interval: 200
    election-timeout: 1000
    
    Why it works: A larger heartbeat-interval means the leader sends heartbeats less frequently, reducing network traffic and leader processing load. A larger election-timeout gives followers more time to receive heartbeats before initiating an election, accommodating transient network or processing delays.
  • Example Fix: If the cluster is configured with heartbeat-interval: 50 and election-timeout: 500, changing these to heartbeat-interval: 200 and election-timeout: 1000 provides more breathing room.

Cause 5: Under-provisioned etcd Nodes (Memory/CPU)

  • Diagnosis: General system resource monitoring. Are the etcd nodes meeting the recommended minimums for CPU and RAM?
    free -h
    nproc
    
    etcd generally recommends at least 2 CPU cores and 4GB of RAM per node, with more needed for larger clusters or higher request loads.
  • Fix: Scale up the nodes running etcd. This means adding more CPU cores or RAM. Why it works: etcd’s Raft consensus algorithm, while efficient, still requires sufficient computational and memory resources to operate smoothly, especially under load. Under-provisioning leads to general sluggishness that manifests as slow follower sync.
  • Example Fix: If etcd nodes only have 1 CPU core and 2GB RAM, upgrading them to 4 cores and 8GB RAM will provide ample headroom.

Cause 6: Network Interface Card (NIC) Offloading Issues or Driver Bugs

  • Diagnosis: Check system logs (dmesg, syslog) for NIC-related errors or warnings. Examine NIC offloading settings.
    ethtool -k <interface-name>
    
    Look for tx-checksumming or rx-checksumming which can sometimes cause issues with high-throughput network traffic.
  • Fix: Disable problematic offloading features or update NIC drivers.
    # Example: Disable tx/rx checksumming
    sudo ethtool -K <interface-name> tx off rx off
    
    Why it works: Certain NIC offloading features, especially checksum offloading, can sometimes be buggy or interact poorly with specific network stacks or high-speed traffic, leading to corrupted packets or dropped traffic that disrupts etcd’s communication.
  • Example Fix: If dmesg shows warnings about checksum offload errors and ethtool -k eth0 shows tx-checksumming: on, disabling it with ethtool -K eth0 tx off might resolve intermittent packet loss.

The next error you’ll likely see is etcdserver: mvcc: database space exceeded if disk I/O was the bottleneck and the database grew too large during the slow periods.

Want structured learning?

Take the full Etcd course →