Distributed Locking: Consensus & Failure Modes

Redis, ZooKeeper, and etcd are all capable of implementing distributed locks, but they do so with fundamentally different guarantees and operational complexities.

Let’s see how this plays out in practice with a simple scenario: acquiring a lock named my_resource_lock.

Redis

Redis is often the go-to for its speed and simplicity. The common pattern involves SETNX (SET if Not eXists) or SET with NX and EX options.

# Attempt to acquire the lock
SET my_resource_lock some_unique_value NX EX 30

# If successful, you get 'OK'. If not, you get nil.
# 'some_unique_value' should be a unique identifier for your client.
# 'EX 30' sets an expiration of 30 seconds.

If you acquire the lock, you hold it for 30 seconds. To release it, you’d use a Lua script to ensure you only delete it if the some_unique_value still matches, preventing accidental release by another client after your lock expired and was re-acquired.

# Release the lock (Lua script)
EVAL "if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end" 1 my_resource_lock your_unique_value

The core problem Redis solves here is preventing simultaneous access to a shared resource. The primary mechanism is atomicity: the SET NX EX command is a single, indivisible operation. If the key doesn’t exist, it’s created with the value and expiration. If it does, the command fails. The expiration handles the case where a client holding the lock crashes, preventing a permanent deadlock.

The surprising truth about Redis locks is that they are not strictly correct distributed locks according to the formal definitions (like those provided by Paxos or Raft). They are often called "fencing tokens" or "best-effort" locks. The Redis documentation itself highlights this: "This is not a fully correct distributed lock". The issue arises in complex failure scenarios, particularly network partitions and clock drift. A client might think its lock has expired and re-acquire it, while the network was just slow, and the original client still believes it holds the lock. This can lead to split-brain scenarios where multiple clients believe they have the lock.

ZooKeeper

ZooKeeper guarantees strong consistency and provides a hierarchical, znode-based namespace. Distributed locks in ZooKeeper are typically implemented using ephemeral, sequential znodes.

Here’s the conceptual flow:

Create an ephemeral znode: A client creates an ephemeral znode under a designated lock path (e.g., /locks/my_resource_lock/). The ephemeral nature means the znode is automatically deleted if the client’s session disconnects.
Get children and sort: The client then retrieves all children of the lock path and sorts them by their sequence numbers.
Acquire the lock: If the client’s znode has the lowest sequence number among all children, it holds the lock.
Watch the predecessor: If the client’s znode is not the lowest, it watches the znode immediately preceding its own in the sorted list. When that predecessor znode is deleted (meaning the client holding that lock has released it or crashed), the client holding the watch is notified and can re-evaluate if it now holds the lock.

The strength of ZooKeeper’s approach lies in its adherence to the ZooKeeper consensus algorithm, which provides strong consistency guarantees. This means that all clients see the same state of the system, and operations are ordered. The ephemeral and sequential nature of znodes is crucial:

Ephemeral: Guarantees that if a client crashes, its lock attempt (the znode) is automatically cleaned up.
Sequential: Assigns a unique, monotonically increasing sequence number to each znode created. This allows clients to reliably determine which lock holder is "next" in line.

The entire process ensures that only one client can be the "first" (lowest sequence number) at any given time, effectively granting exclusive access. The watch mechanism ensures that clients are notified promptly when their turn arrives without constant polling.

The most counterintuitive aspect of ZooKeeper locks is how it handles leader election and failover. When a ZooKeeper server fails, the ensemble (the cluster of ZooKeeper servers) re-elects a leader. During this re-election period, ZooKeeper is unavailable for writes. This means that if your lock acquisition or release coincides with a leader election, your operation will be delayed. While this ensures correctness, it’s a significant performance consideration that many users underestimate when choosing ZooKeeper for high-throughput, low-latency locking.

etcd

etcd, like ZooKeeper, is a distributed key-value store that uses the Raft consensus algorithm to provide strong consistency. Its API is simpler than ZooKeeper’s, and it’s often favored in Kubernetes environments.

A common pattern for distributed locks in etcd involves using leases.

Create a lease: A client first creates a lease with a TTL (Time To Live). This lease is associated with a unique ID.
Attempt to create a key with the lease: The client then attempts to create a key (e.g., /locks/my_resource_lock) and attach the lease to it. This operation is typically done with a condition that the key must not already exist.
Acquire the lock: If the key creation succeeds, the client holds the lock. The lease’s TTL acts as the lock’s expiration. The client must periodically "keep alive" the lease to prevent it from expiring.
Release the lock: To release the lock, the client revokes the lease. If the client crashes or the lease expires, etcd automatically deletes the key associated with that lease.

// Conceptual Go snippet using etcd client library
import (
	"context"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
)

// ... client initialization ...

// Create a lease with a 10-second TTL
resp, err := cli.Grant(context.TODO(), 10)
if err != nil {
	// handle error
}
leaseID := resp.ID

// Use a unique value to identify the lock holder
// The key represents the resource being locked
lockKey := "/locks/my_resource_lock"
lockValue := "unique_client_id_123"

// Attempt to create the key, only if it doesn't exist, and attach the lease
// The transaction ensures atomicity.
txnResp, err := cli.Txn(context.TODO()).
	If(clientv3.Compare(clientv3.Version(lockKey), "=", 0)). // Ensure key doesn't exist
	Then(clientv3.OpPut(lockKey, lockValue, clientv3.WithLease(leaseID))).
	Else(). // If key exists, do nothing
	Commit()
if err != nil {
	// handle error
}

if txnResp.Succeeded {
	// Lock acquired! Keep the lease alive.
	// Start a goroutine to periodically keep the lease alive
	go func() {
		ticker := time.NewTicker(5 * time.Second) // Keep alive before it expires
		defer ticker.Stop()
		for range ticker.C {
			_, err := cli.KeepAliveOnce(context.TODO(), leaseID)
			if err != nil {
				// Handle error, lease might have expired or client disconnected
				break
			}
		}
	}()
	// ... do work ...

	// Release the lock by revoking the lease
	_, err = cli.Revoke(context.TODO(), leaseID)
	if err != nil {
		// handle error
	}
} else {
	// Lock is held by someone else
}

etcd’s lease mechanism is elegant. It leverages Raft for strong consistency, meaning all nodes in the etcd cluster agree on the state of keys and leases. The lease TTL provides automatic cleanup upon client failure. The transaction API (Txn) ensures that checking for the key’s existence and creating it with the lease is an atomic operation. This prevents race conditions where two clients might simultaneously see that the lock isn’t held and both attempt to acquire it.

The most overlooked aspect of etcd leases is their interaction with the KeepAlive mechanism. While KeepAlive is designed to prevent leases from expiring, it’s not infallible. Network latency between the client and the etcd cluster can cause KeepAlive requests to be delayed. If a KeepAlive request arrives after the lease TTL has expired on the etcd server, the lease is revoked, and the lock is released, even if the client intended to hold it longer. This means your KeepAlive interval needs to be significantly shorter than your lease TTL to provide a buffer against network jitter.