ZooKeeper’s ZAB protocol is how a cluster of ZooKeeper servers agree on the order of operations, ensuring consistency across all nodes even if some fail.

Let’s watch it in action. Imagine a ZooKeeper ensemble of three nodes: zk1, zk2, and zk3. A client connects to zk1 and requests to create a znode at /mydata.

# Client connects to zk1
> create /mydata "hello"

zk1, if it’s the leader, proposes this write operation to its followers (zk2 and zk3). The proposal includes a transaction ID (zxid) and a sequence number within that epoch.

zk1 sends a PROPOSE message to zk2 and zk3: PROPOSE(zxid=256, epoch=1, data=/mydata, value="hello").

zk2 and zk3 receive this PROPOSE message. If they agree on the leader’s epoch and that this is a valid transaction, they acknowledge it by sending a ACK back to zk1.

zk2: ACK(zxid=256, epoch=1) zk3: ACK(zxid=256, epoch=1)

Once zk1 receives acknowledgments from a majority of the ensemble (in this case, zk2 and zk3), it commits the transaction. It then broadcasts a COMMIT message to zk2 and zk3.

zk1: COMMIT(zxid=256, epoch=1) zk2: COMMIT(zxid=256, epoch=1) zk3: COMMIT(zxid=256, epoch=1)

Upon receiving the COMMIT message, zk1, zk2, and zk3 all apply the change to their local state. The znode /mydata with value "hello" now exists consistently across the ensemble. The client receives a success response.

This process is the core of ZAB: the leader proposes, followers acknowledge, and upon majority acknowledgment, the leader commits and followers apply. It’s a two-phase commit-like mechanism optimized for distributed systems.

What problem does this solve? In distributed systems, multiple nodes need to agree on a shared state. Without ZAB, if zk1 crashed after receiving the client request but before replicating it, the creation of /mydata would be lost. Other clients connecting to zk2 or zk3 wouldn’t see /mydata, leading to an inconsistent view. ZAB ensures that a committed transaction is durable and visible to all clients interacting with any server in the ensemble, regardless of which server initially received the request.

The leader election process is also crucial. If the leader fails, the remaining nodes engage in a new election to choose a new leader. This new leader will be the one that has the most up-to-date transaction log, ensuring that no committed transactions are lost. ZAB uses a form of leader-based consensus where the leader drives the transaction ordering.

The zxid (ZooKeeper Transaction ID) is a 64-bit number. The upper 32 bits represent the epoch (a period of leadership), and the lower 32 bits represent the transaction count within that epoch. This structure ensures that transactions from older epochs are never considered newer than transactions from the current epoch. When a new leader is elected, it increments the epoch number.

When a follower receives a PROPOSE message, it checks if the zxid’s epoch matches its understanding of the current leadership epoch. If the follower’s epoch is older, it will request the missing transactions from the leader (or another follower that has them) to catch up before it can acknowledge the proposal. This log synchronization is vital for maintaining consistency.

The heart of ZAB’s fault tolerance lies in its ability to recover. If a follower crashes, when it restarts, it will sync its transaction log with the current leader. If the leader crashes, the ensemble will elect a new leader. This new leader will be the one that has committed the most recent transactions. Any transactions proposed by the old leader but not yet committed by a majority will be discarded. Any transactions committed by the old leader will be re-proposed by the new leader to the rest of the ensemble.

The leader’s role is to maintain a proposal queue and ensure that all transactions are ordered and replicated. It doesn’t simply forward client requests; it assigns a unique zxid and sequence number to each transaction and orchestrates the commit process. This strict ordering is what guarantees consistency.

A common point of confusion is that ZooKeeper’s read operations are not typically part of the ZAB consensus protocol. Reads can be served by any server, including followers, without waiting for a majority acknowledgment. This is because reads only need to reflect the latest committed state, not necessarily the absolute latest proposed transaction. However, reads are strongly consistent after a write has been committed and propagated.

The sequence number within an epoch is what ensures that within a single leader’s tenure, transactions are applied in the exact order they were proposed. If a leader were to propose transaction A and then transaction B, but a follower only received and acknowledged A, and then the leader crashed and a new leader proposed C, ZAB’s recovery mechanism would ensure that C is not committed until A and B are accounted for, maintaining the strict ordering.

The next concept you’ll likely encounter is how ZooKeeper uses these ordered transactions to manage its hierarchical data structure (znodes) and the watch mechanism for clients.

Want structured learning?

Take the full Distributed Systems course →