Cassandra doesn’t actually replicate data across datacenters; it replicates data centers across datacenters, and your data just happens to ride along.
Let’s see this in action. Imagine you have two datacenters, DC1 and DC2. You’ve got a keyspace, my_keyspace, with a replication strategy that looks like this:
{
"my_keyspace": {
"replication": {
"class": "NetworkTopologyStrategy",
"DC1": 3,
"DC2": 3
}
}
}
This tells Cassandra that for any table within my_keyspace, it should maintain 3 replicas in DC1 and 3 replicas in DC2. When you write a row to my_keyspace, say with a primary key of 123, Cassandra calculates which nodes should hold replicas for that key based on the token ring and the replication factor. It then sends the write request to all nodes responsible for holding a replica of that token range, regardless of datacenter.
The magic of NetworkTopologyStrategy is how it chooses those nodes. It consults its internal topology information, which it learns from gossip, to know which nodes belong to which datacenters. For our my_keyspace example, it will pick 3 nodes in DC1 and 3 nodes in DC2 that own the token range for key 123. The write then propagates to those specific nodes.
The fundamental problem this solves is availability and disaster recovery. If an entire datacenter goes offline, your data remains accessible from the remaining datacenter(s) because copies of the data (and the nodes that serve them) exist elsewhere. It’s not just about having the data; it’s about having the computational capacity to serve that data from a separate physical location.
Internally, Cassandra uses a consistent hashing algorithm to map data to tokens. The NetworkTopologyStrategy then overlays the datacenter awareness onto this. When a write comes in, the coordinator node (which is any node that receives the client request) determines the target token range. It then consults the cluster metadata to identify all nodes that are both responsible for that token range and are in the datacenters specified in the replication strategy. The replication factor dictates how many of those nodes per datacenter must acknowledge the write.
A common misconception is that replication is datacenter-centric after the initial write. It’s not. The NetworkTopologyStrategy is applied at keyspace creation and dictates the target distribution of replicas. Cassandra’s gossip protocol constantly informs nodes about the location of other nodes and their respective datacenters. This information is crucial for the coordinator to select the correct set of replica nodes across datacenters for any given piece of data.
The Hinted Handoff mechanism is also datacenter-aware. If a replica node is temporarily unavailable, the coordinator will store a "hint" on another node within the same datacenter as the intended replica. This ensures that when the unavailable node comes back online, the data can be delivered to it without requiring a cross-datacenter hop for the hint itself.
The placement of your data isn’t determined by the NetworkTopologyStrategy alone; it’s a complex interplay between the hashing of your primary key, the token distribution across all nodes, and the datacenter configuration you’ve set. A replication factor of 3 in DC1 means Cassandra will try to place 3 replicas on 3 distinct nodes within DC1 that own the relevant token ranges. If there aren’t enough nodes in DC1 to satisfy the replication factor, Cassandra will warn you, and the effective replication factor for that datacenter might be lower.
The next concept to wrap your head around is how repair operations work across datacenters and the implications for data consistency.