Heartbeat protocols are the unsung heroes of distributed systems, and their most surprising trick is that they don’t actually detect failures; they detect liveness.

Imagine two servers, web-01 and web-02, that need to know if the other is still alive and kicking. If web-01 stops sending its "I’m alive!" signal, web-02 doesn’t know web-01 has crashed. It only knows it hasn’t heard from web-01. This distinction is crucial because network glitches, temporary packet loss, or a brief garbage collection pause on web-01 can all cause it to miss sending a heartbeat without actually being dead.

Let’s see this in action with a common tool for implementing heartbeats: redis. We’ll set up two Redis instances, redis-a and redis-b, and configure them to monitor each other.

First, on redis-a, we’ll set a key that represents its liveness, and give it a Time-To-Live (TTL) so it automatically disappears if redis-a goes silent.

# On redis-a
redis-cli SET server:redis-a:heartbeat 1 EX 5

This command sets a key server:redis-a:heartbeat to the value 1 and makes it expire in 5 seconds.

Now, on redis-b, we’ll periodically check for the existence of redis-a’s heartbeat.

# On redis-b
redis-cli EXISTS server:redis-a:heartbeat

If redis-b receives a 1 (meaning the key exists), redis-a is considered alive. If it receives a 0, the key has expired, and redis-b infers that redis-a is likely down.

In a real-world scenario, this check would be part of an automated script or a load balancer. If redis-b detects that redis-a’s heartbeat has expired, it might trigger an alert, stop sending traffic to redis-a, or even attempt to failover a service that redis-a was managing.

The core problem heartbeats solve is maintaining a shared understanding of system state in the face of unreliable networks and potential component failures. Without them, a server might continue to accept requests or perform actions that its peers believe it’s incapable of, leading to data inconsistencies or service disruptions.

The liveness signal is typically a simple message or a regularly updated timestamp. The interval at which this signal is sent and the timeout period – how long a peer waits before declaring the sender dead – are the critical tuning parameters. A short interval and timeout mean faster detection of failures but also increase the risk of false positives due to transient network issues. A longer interval and timeout reduce false positives but delay failure detection.

The mechanism of a heartbeat often involves a dedicated network channel or a shared resource (like a database entry or a distributed lock). The sender periodically updates this resource to indicate its presence. The receiver monitors this resource, expecting regular updates. If the updates stop for longer than a predefined threshold, the receiver assumes the sender has failed.

One critical aspect often overlooked is the network path between the heartbeating nodes. If node A can send heartbeats to node B, but node B cannot send heartbeats back to node A, node B might appear to be down from node A’s perspective, even though it’s perfectly functional. This asymmetry in communication can lead to split-brain scenarios where both nodes believe they are the primary and start acting independently, causing data corruption. This is why bi-directional heartbeats or a robust quorum mechanism are essential for critical systems.

The next challenge is ensuring that a detected "failure" actually triggers the correct recovery or failover action without introducing new problems.

Want structured learning?

Take the full Distributed Systems course →