Cassandra’s vnodes are a fundamentally different way to distribute data than the older single-token approach, and understanding that difference is key to scaling and managing your cluster.
Let’s see it in action. Imagine a simple 3-node cluster.
Single Token (Old Way)
With single tokens, each node is responsible for a contiguous range of the token ring. If you have 256 tokens (a common starting point), each node might own tokens 0-85, 86-170, and 171-255.
Node A: 0-85
Node B: 86-170
Node C: 171-255
When data is written, its hash determines its token. If a token falls into Node A’s range, Node A is the primary owner.
Vnodes (New Way)
With vnodes, each node owns multiple non-contiguous ranges. If you have 256 tokens and 3 nodes, but enable vnodes with 10 vnodes per physical node, you have 30 total vnodes. Each node now owns 10 of those 30 vnodes.
Node A: [15, 78, 123, 155, 180, 201, 220, 235, 240, 250]
Node B: [5, 30, 90, 140, 160, 190, 205, 225, 230, 245]
Node C: [10, 40, 65, 100, 130, 175, 195, 210, 238, 255]
(These are just example token IDs, not actual ranges. Each token ID represents a small, discrete range.)
When data is written, its hash determines its token. If a token falls into the range represented by 15, Node A is the primary owner of that specific token. If another piece of data hashes to 90, Node B owns it.
This model is designed to solve the problem of uneven data distribution and simplify cluster operations. In the single-token world, adding a new node meant reassigning a massive chunk of tokens from existing nodes to the new one, leading to significant data streaming and potential performance degradation. With vnodes, adding a new node simply means assigning it a new set of vnodes. The existing nodes then stream only the data corresponding to those specific vnodes to the new node, a much smaller and more manageable operation.
The key advantage of vnodes is that they dramatically simplify adding or removing nodes from a cluster. Instead of reassigning huge, contiguous token ranges, you’re just shifting ownership of a few small, scattered vnodes. This means less data to stream, less impact on cluster performance during rebalancing, and a much smoother scaling experience. Vnodes also lead to a more even distribution of data and load across the cluster naturally, as each node is likely to own tokens from all parts of the token ring.
Here’s how you configure vnodes. In your cassandra.yaml file (typically located at /etc/cassandra/cassandra.yaml), you’ll find the num_tokens parameter.
# Number of virtual nodes per physical node.
# num_tokens: 1
To enable vnodes, you set num_tokens to a value greater than 1. A common starting point is 10 or 16, but this can be tuned based on your cluster size and data distribution needs. For example, to set 16 vnodes per node:
num_tokens: 16
Important Note: You can only set num_tokens when a node is first bootstrapped. If you change this setting on an existing node, it will not take effect until the node is rebuilt or the token assignments are manually reset, which is a complex and generally discouraged operation. The best practice is to decide on your num_tokens value before you start adding nodes to your cluster.
The choice between single tokens and vnodes is largely historical. Vnodes are the default and recommended approach for all modern Cassandra deployments. The primary "pro" of vnodes is operational simplicity and better load balancing. The primary "con" is that it might feel slightly more complex to visualize initially, as each node doesn’t own a single, contiguous block of tokens.
If you’re running an older cluster or have specific reasons for using single tokens, the "pro" is that it’s simpler to reason about the exact token ranges each node is responsible for. The "con" is the pain of rebalancing and the potential for uneven data distribution.
The migration path from single tokens to vnodes is non-trivial. It typically involves bootstrapping new nodes with vnodes enabled, streaming data from the old nodes to the new ones, and then decommissioning the old nodes. This is a careful, phased process.
The most surprising thing about vnodes is that while they abstract away the contiguous token ranges, the underlying mechanism still relies on hashing data to specific token values. The "vnode" itself is just a label for a specific token ID that a node owns. When data is written, Cassandra hashes the partition key to a token. It then checks which node owns that specific token ID (or a vnode associated with that token ID). If it’s not the primary replica, it forwards the request.
When you run nodetool status, you’ll see the token assignments. With single tokens, you’d see one token per node. With vnodes, you’ll see multiple tokens listed for each node, reflecting its ownership of multiple virtual nodes.
The next thing you’ll likely encounter is understanding how replication factor and consistency levels interact with vnode distribution for read and write operations.