CockroachDB doesn’t just replicate data; it uses the Raft consensus algorithm to ensure that every replica of a piece of data agrees on its state, even when the network is unreliable or nodes fail.
Let’s see this in action with a simple transaction. Imagine we have three nodes (1, 2, 3) and a key-value pair {"key": "value"}. This pair is replicated across all three nodes. Now, we want to update it to {"key": "new_value"}.
- Client sends write request: A client sends a
PUTrequest forkeywithnew_valueto any of the nodes, say Node 1. - Leader election (if needed): For the range containing
key, one node is designated the Raft leader. If Node 1 isn’t the leader, it forwards the request to the current leader, say Node 2. - Leader proposes the change: Node 2, as the leader, proposes the
PUToperation to its Raft followers (Node 1 and Node 3). This proposal is an AppendEntries RPC containing the new log entry. - Followers acknowledge: Nodes 1 and 3 receive the AppendEntries RPC. If they can append the entry to their log, they send an acknowledgment (a success response) back to Node 2.
- Leader commits: Once Node 2 receives acknowledgments from a majority of the nodes (itself + at least one follower), it considers the log entry committed. This means the change is now permanent and will eventually be applied by all nodes.
- Leader applies the change: Node 2 applies the
PUToperation to its local key-value store, changingvaluetonew_value. - Followers apply the change: Nodes 1 and 3, upon receiving confirmation that the entry is committed (often piggybacked on subsequent RPCs or via a separate CommitIndex notification), also apply the change to their local stores.
- Leader responds to client: Node 2 sends a success response back to the client.
This process ensures that even if Node 2 crashes immediately after committing but before applying, the committed entry in the logs of Nodes 1 and 3 will allow a new leader to be elected, and the change will still be applied to all replicas.
CockroachDB uses Raft to manage distributed transactions, schema changes, and general data consistency across its nodes. The core idea is that each range (a contiguous subset of keys) in CockroachDB is managed by a Raft group. Each replica of a range is a node in that Raft group. The Raft leader for a range is responsible for receiving client requests, proposing changes to its Raft group, and ensuring those changes are agreed upon and committed by a majority of the replicas before acknowledging success to the client. This mechanism provides strong consistency, meaning all clients see the same data in the same order, regardless of which node they communicate with.
The crucial part is that Raft doesn’t just replicate data; it replicates log entries. A log entry represents a command to be executed (like a PUT or DELETE). The Raft algorithm guarantees that if a log entry is committed, it will be present in the logs of all non-faulty replicas, and that all replicas will apply committed entries in the same order. This ordered application is what guarantees consistency.
The Raft leader is elected based on heartbeats and log matching. If a leader becomes unavailable, the remaining nodes in the Raft group will elect a new leader from the nodes that have the most up-to-date log. This election process, along with the log replication mechanism, is what makes the system fault-tolerant.
When you look at a CockroachDB node’s logs, you’ll see Raft messages: AppendEntries RPCs (the leader sending log entries to followers) and RequestVote RPCs (used during leader election). The AppendEntries RPCs are the workhorses, carrying new commands and also serving as heartbeats to maintain leadership. A follower acknowledges receipt of an entry by returning its matchIndex for that entry. The leader tracks the highest matchIndex acknowledged by a majority of the group, and that’s its commitIndex.
A common point of confusion is how CockroachDB handles transactions that span multiple ranges. For these distributed transactions, CockroachDB employs a two-phase commit (2PC) protocol on top of Raft. Each range involved in the transaction has its own Raft group. The transaction coordinator (which might be one of the nodes involved in the transaction) uses Raft to ensure that all participants agree on whether to commit or abort the entire transaction. Raft is used at the range level for data replication and at the transaction coordinator level for coordinating the commit decision across multiple Raft groups.
The Raft log is not infinite. CockroachDB periodically snapshots the state of a range (the key-value data itself) and truncates the Raft log up to that snapshot point. This prevents the Raft logs from growing indefinitely and consuming excessive disk space. The snapshot essentially captures the current state, and any log entries before the snapshot are no longer needed for state reconstruction, only for the leader election and log replication process itself.
The next logical step after understanding Raft’s role in consistency is to explore how CockroachDB optimizes Raft for performance in a distributed environment, such as batching log entries and using optimized network protocols.