Cassandra’s "read repair" is a background process that secretly fixes data inconsistencies, often only when you ask for the data.
Let’s watch it in action. Imagine we have a simple keyspace with a SimpleStrategy and a replication factor of 3.
CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
USE mykeyspace;
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username text,
email text
);
We’ll insert a record and then, to simulate an inconsistency, manually send a write to only two out of the three replicas.
-- Initial insert
INSERT INTO users (user_id, username, email) VALUES (uuid(), 'alice', 'alice@example.com');
Now, let’s assume our cluster has nodes node1, node2, and node3. The coordinator for the write might send it to node1 and node2, but node3 could be temporarily unavailable or experience a network hiccup. After the write, node1 and node2 have the latest version of the data, but node3 has an older version or no version at all.
When a client reads this data, Cassandra’s coordinator receives the read request. By default, if the consistency level is QUORUM or higher (e.g., LOCAL_QUORUM), the coordinator waits for a majority of replicas to respond. Let’s say we read with LOCAL_QUORUM (which, with SimpleStrategy on a single DC, is the same as QUORUM). The coordinator sends the read request to all replicas.
If node1 and node2 respond with the latest data, and node3 responds with stale data or times out, the coordinator will compare the responses. It sees that node1 and node2 agree. If the number of agreeing replicas meets the QUORUM requirement, the coordinator returns the data from node1 (or node2) to the client.
Crucially, at this point, the coordinator also identifies the inconsistency. It knows node3 has different (or no) data. This is where read repair kicks in. The coordinator will asynchronously send a hinted handoff message to node3, containing the latest version of the data it received from node1 and node2. This happens in the background and doesn’t block the client’s read response. node3 will then update its data.
The "lazy" part is that this repair only happens when a read request happens to involve the inconsistent replica and the consistency level is high enough to detect the discrepancy. If no reads ever touch node3 for that particular piece of data, or if reads are always at a low consistency level (like ONE), the inconsistency might persist for a long time.
The problem this solves is maintaining data consistency across replicas without requiring constant, resource-intensive background synchronization for every single piece of data. It balances consistency guarantees with performance by doing the heavy lifting only when data is actively being accessed and the consistency level demands it.
Internal Mechanics:
When a coordinator receives multiple responses for a read, it determines the "latest" version based on timestamps. Each piece of data in Cassandra has a timestamp associated with it, generated when it was written. The coordinator will compare these timestamps across all received replicas. If a replica returns data with an older timestamp than the majority, or no data at all, it’s marked as stale. The coordinator then sends the most recent version (identified by the highest timestamp) to the stale replica(s) as a background repair operation. This is why ensuring accurate clock synchronization across all your Cassandra nodes is absolutely critical; otherwise, timestamp-based conflict resolution can go awry.
The levers you control:
read_request_timeout_in_ms: This is the time the coordinator waits for replicas to respond to a read. If a replica doesn’t respond within this time, it’s considered down for that read operation, and read repair might be triggered if other replicas confirm an inconsistency.write_request_timeout_in_ms: Similarly, if a write to a replica times out, it can lead to inconsistencies that read repair might later fix.- Consistency Level: As discussed,
ONEorTWOwon’t trigger read repair.QUORUM,LOCAL_QUORUM,ALLare the levels that enable it. hinted_handoff_enabled: This server-side setting, usually enabled by default, allows nodes to store temporary hints about writes that failed for other nodes. When the failed node comes back online, these hints are delivered. Read repair leverages this mechanism to deliver the corrected data.max_hint_window_in_ms: This defines how long a node will store hints. If a node is down for longer than this, hints for it are discarded, and it might miss updates that read repair would have otherwise delivered.
One thing most people don’t realize is that read repair is entirely asynchronous from the client’s perspective. The coordinator sends the corrected data to the "stale" replica after it has already sent the successful read response back to the client. This means the client gets its data quickly, but the consistency fix is deferred. If the cluster is heavily loaded or network issues are persistent, these background repairs might not keep up, leading to eventual consistency rather than strong consistency, especially if max_hint_window_in_ms is too short and nodes are frequently offline.
The next thing you’ll likely bump into is understanding how anti-entropy mechanisms like nodetool repair complement lazy read repair for more proactive consistency management.