Distributed systems don’t get slow because they’re doing too much work; they get slow because of the communication overhead of coordinating that work.

Let’s watch a simulated distributed key-value store, like Redis or etcd, handle a burst of writes. Imagine we have three nodes: kv-node-1, kv-node-2, and kv-node-3. A client sends a SET key1 value1 request.

# Client sends SET key1 value1 to kv-node-1
# kv-node-1 is the leader for the 'key1' shard.
# kv-node-1 writes to its local log.
# kv-node-1 appends an entry to its Raft log: {term: 3, index: 105, command: SET key1 value1}

# kv-node-1, as leader, forwards this log entry to followers kv-node-2 and kv-node-3.
# kv-node-2 receives entry {term: 3, index: 105, command: SET key1 value1}
# kv-node-3 receives entry {term: 3, index: 105, command: SET key1 value1}

# kv-node-2 appends to its log and acknowledges kv-node-1.
# kv-node-3 appends to its log and acknowledges kv-node-1.

# Once kv-node-1 receives acknowledgements from a majority (e.g., kv-node-2 and kv-node-3),
# it commits the entry.
# kv-node-1 applies the command: stores key1=value1 locally.
# kv-node-1 responds to the client: "OK".

# kv-node-2 and kv-node-3, on their next heartbeat or log replication cycle,
# will also commit and apply the command to their local state.

This process, even for a single write, involves network hops, serialization, deserialization, consensus (Raft, Paxos), and state application. Latency is the time for one request to complete. Throughput is the number of requests completed per unit of time. Bottlenecks are points where resources are exhausted, limiting throughput.

The common culprits for performance degradation in distributed systems are:

  1. Network Saturation/High Latency: If the network links between nodes are oversubscribed or have high inherent latency, Raft/Paxos heartbeats and log replication will slow to a crawl. This directly impacts commit times and thus client latency.

    • Diagnosis: Use iperf3 between nodes to test raw bandwidth. Monitor network interface statistics (ifconfig or ip -s a) for errors, drops, or excessive utilization. Use ping to check round-trip times.
    • Fix: Increase bandwidth (e.g., upgrade from 1Gbps to 10Gbps NICs/switches). Reduce latency by co-locating nodes in the same availability zone or data center, or by optimizing routing.
    • Why it works: Faster, more reliable data transfer between nodes allows consensus protocols to achieve quorum and commit operations much quicker.
  2. CPU Contention on Leader/Followers: The Raft/Paxos leader is responsible for receiving client requests, appending to its log, replicating to followers, and applying committed entries. Followers also need CPU to receive entries, append, and apply. If any node is CPU-bound, it will fall behind.

    • Diagnosis: Monitor CPU utilization per core (top, htop, mpstat -P ALL 1). Look for sustained high utilization (above 80-90%) on critical nodes, especially the leader. Check for high iowait if disk I/O is also a factor.
    • Fix: Scale up the instance size (more vCPUs). Optimize application code or system processes consuming CPU. Offload non-critical tasks to separate instances.
    • Why it works: More CPU cycles allow nodes to process incoming requests, perform Raft/Paxos operations, and apply state changes more rapidly, keeping up with the workload.
  3. Disk I/O Bottlenecks: Raft/Paxos logs are typically written to disk. If the disk subsystem cannot keep up with the rate of log appends, replication will stall. Applying state changes (e.g., writing to a database or updating an in-memory map) can also be disk-bound.

    • Diagnosis: Monitor disk I/O metrics: IOPS, throughput (MB/s), and latency (iostat -xz 1). Look for high %util, high await times, and low rkB/s or wkB/s relative to available bandwidth.
    • Fix: Use faster storage (e.g., provisioned IOPS SSDs, NVMe drives). Optimize the filesystem or database configuration. Reduce the rate of writes if possible.
    • Why it works: Faster disk I/O allows log entries to be persisted quickly, enabling the leader to acknowledge and commit them, and followers to catch up without blocking replication.
  4. Garbage Collection Pauses (for JVM-based systems): If your distributed system runs on the JVM (e.g., Kafka, Elasticsearch), long garbage collection (GC) pauses can halt application threads, including those involved in network I/O and consensus.

    • Diagnosis: Enable GC logging (-Xlog:gc*:file=gc.log). Analyze GC logs for long pause times (e.g., > 100ms). Monitor application thread states (e.g., using jstack or APM tools) to see if threads are frequently in a "GC" state.
    • Fix: Tune GC parameters (e.g., use G1GC or ZGC, adjust heap size -Xmx, -Xms). Optimize object allocation patterns to reduce GC pressure.
    • Why it works: Minimizing GC pause times ensures that application threads remain responsive, allowing them to participate in network communication and consensus protocols without significant delays.
  5. Inefficient Serialization/Deserialization: The data exchanged between nodes (log entries, heartbeats, client requests) needs to be serialized and deserialized. Inefficient formats or implementations can consume significant CPU and add latency.

    • Diagnosis: Profile the application to identify time spent in serialization/deserialization libraries. Monitor CPU usage patterns.
    • Fix: Switch to a more performant serialization format (e.g., Protocol Buffers, Avro instead of JSON or XML). Optimize the implementation of custom serializers.
    • Why it works: Faster serialization and deserialization reduce the CPU overhead per message, allowing nodes to process more messages in the same amount of time and reducing overall request latency.
  6. Resource Starvation/Throttling: If the system runs in a noisy neighbor environment (e.g., shared cloud instances) or has resource limits applied (e.g., Kubernetes limits), it might be throttled externally.

    • Diagnosis: Check cloud provider metrics for instance throttling (CPU, network, disk). Examine Kubernetes pod metrics for CPU/memory throttling. Review system logs for any indications of being rate-limited.
    • Fix: Increase resource limits or quotas. Migrate to dedicated instances or a less contended environment. Adjust application behavior to stay within limits.
    • Why it works: Removing external resource constraints allows the system to utilize its allocated hardware resources fully, preventing artificial slowdowns.
  7. Lock Contention within the Application: Even if the consensus protocol is fast, the actual application of the state change might involve internal locks. If multiple threads contend for the same locks while applying updates, this becomes a bottleneck.

    • Diagnosis: Use thread dumps (jstack for Java) or profiling tools to identify threads waiting on locks. Analyze the code path for critical sections and lock acquisition patterns.
    • Fix: Reduce the scope of locks. Use more granular locks or lock-free data structures. Parallelize the application of state changes if the operation is idempotent or can be applied in any order.
    • Why it works: Reducing internal lock contention allows more threads to make progress concurrently, speeding up the application of committed state changes and reducing the overall time to complete a transaction.

The next error you’ll likely encounter after fixing these issues is a Request Timeout on a different component, often a downstream service or a caching layer, because now your distributed system is fast enough to expose their limitations.

Want structured learning?

Take the full Distributed Systems course →