Debugging Distributed Consensus: The Hard Parts

Consensus failures in distributed systems happen because nodes can’t agree on the state of the system, leading to data inconsistencies or service outages.

Here’s a typical scenario: a Kafka cluster where consumers are reporting messages are being processed twice, or not at all, and the controller logs are full of [Error] Controller 0-0: Error sending fetch request to broker 2: (error code 5, LEADER_NOT_AVAILABLE) or [Error] Controller 0-0: Error sending produce request to broker 1: (error code 5, LEADER_NOT_AVAILABLE).

This specific error, LEADER_NOT_AVAILABLE, means a Kafka controller (the brain of the cluster) tried to talk to a broker that it expected to be the leader for a partition, but that broker either didn’t respond or explicitly said it wasn’t the leader. This breaks the chain of command for producing and consuming messages.

Common Causes and Fixes

Broker Crashing or Unresponsive:
- Diagnosis: Check the broker logs (e.g., /var/log/kafka/server.log) for signs of crashes (OutOfMemoryError, StackOverflowError, fatal error) or long garbage collection pauses. Monitor broker resource utilization (CPU, memory, disk I/O) using tools like top, htop, or Prometheus/Grafana.
- Fix: If a broker is consistently crashing due to OOM, increase its JVM heap size. For example, in kafka-server-start.sh or a systemd unit file, find the KAFKA_HEAP_OPTS environment variable and adjust it, e.g., export KAFKA_HEAP_OPTS="-Xmx8g -Xms8g". If it’s resource contention, scale up the underlying hardware or optimize other services on the same machine.
- Why it works: Providing more memory or reducing contention allows the broker process to run without crashing or being starved of resources, enabling it to respond to controller requests.
Network Partition:
- Diagnosis: Use ping and traceroute between controller nodes and affected brokers to check for packet loss or high latency. Examine firewall logs on both client and server sides for dropped connections. Tools like tcpdump can reveal if packets are even reaching the destination.
- Fix: If a firewall is blocking traffic, open the necessary Kafka ports (e.g., 9092 for clients, 2888 and 3888 for ZooKeeper quorum if applicable, 9093 for inter-broker communication). If it’s a network misconfiguration, correct routing or switch configurations. Ensure advertised.listeners and listeners in server.properties correctly reflect the network interfaces brokers should use.
- Why it works: Restoring network connectivity allows the controller to communicate with the brokers as expected, resolving the LEADER_NOT_AVAILABLE error by enabling the controller to find and interact with the actual leader.
ZooKeeper Issues (if using ZooKeeper for Kafka metadata):
- Diagnosis: Check ZooKeeper server logs (zookeeper.out) for errors like ZooKeeperServer.myid file is missing or Out of memory. Verify ZooKeeper quorum health: all ZooKeeper nodes should be in mode: follower or mode: leader. Use echo stat | nc <zookeeper_host> 2181 to check individual ZooKeeper node status.
- Fix: Ensure ZooKeeper nodes can communicate with each other. If myid is missing, recreate it in the ZooKeeper data directory. If OOM, increase ZooKeeper’s JVM heap size (JAVA_OPTS in zkServer.sh or systemd unit). Ensure ZooKeeper’s tickTime, syncLimit, and initLimit are appropriately configured for your network.
- Why it works: Kafka relies heavily on ZooKeeper for leader election and metadata. A healthy ZooKeeper ensemble ensures that Kafka brokers can correctly register, discover leaders, and maintain cluster state.
Broker Disk Full or I/O Throttling:
- Diagnosis: Monitor disk space on broker nodes (df -h). Check broker logs for IOError or KafkaException: Failed to write to log. Use iostat -xz 1 to observe disk utilization and await times.
- Fix: Free up disk space by deleting old logs or increasing storage capacity. If I/O is the bottleneck, upgrade to faster disks (SSDs) or optimize Kafka’s log retention policies (log.retention.hours, log.retention.bytes) to prevent disks from filling up.
- Why it works: Kafka needs to write data to disk for durability and to serve requests. Full disks or slow I/O prevent these operations, making brokers appear unresponsive and causing leader election failures.
Incorrect replica.lag.time.max.ms Configuration:
- Diagnosis: This is a more subtle one. If brokers are healthy but intermittently slow to replicate, the controller might consider a broker unavailable if it hasn’t caught up within this threshold. Check server.properties for replica.lag.time.max.ms.
- Fix: Increase replica.lag.time.max.ms (e.g., from default 10000ms to 30000ms or higher). This gives replicas more time to catch up before being considered out of sync.
- Why it works: This setting is a timeout for how long a replica can lag before the controller considers it unhealthy. Increasing it provides more grace period for temporary network glitches or brief broker slowdowns, preventing premature leader demotions.
Controller Overload or Misconfiguration:
- Diagnosis: If you have multiple Kafka brokers, check the logs of the controller broker (you can often identify it by a [Controller id=X] tag). Is it overwhelmed with requests? Are its logs showing similar network errors to the ones affecting other brokers?
- Fix: Ensure the controller broker has sufficient resources. If it’s a dedicated controller, it should have good network connectivity and CPU. Sometimes, simply restarting the controller broker can resolve transient issues. If you suspect a ZooKeeper interaction issue, ensure ZooKeeper is healthy.
- Why it works: The controller is responsible for managing partitions, leaders, and replicas. If the controller itself is unhealthy or struggling, it can’t accurately track partition leaders, leading to widespread LEADER_NOT_AVAILABLE errors across the cluster.

The next error you’ll likely hit after fixing these is related to partition reassignments or unclean leader elections, as the system tries to recover from the prior state of instability.