Cassandra secondary indexes are not a magic bullet for all query needs; in fact, they often introduce more problems than they solve when used without understanding their fundamental limitations.
Let’s see how this plays out in practice. Imagine you have a users table, and you want to query users by their email address.
CREATE TABLE users (
user_id UUID,
username text,
email text,
PRIMARY KEY (user_id)
);
CREATE INDEX ON users (email);
Now, if you try to query by email:
SELECT * FROM users WHERE email = 'alice@example.com';
This query works, but it’s not what you might expect. Cassandra doesn’t magically know which user_id corresponds to that email. Instead, it creates an internal, hidden table (often called a "shadow table" or "index table") that maps the indexed column’s value (alice@example.com) to the primary key of the row (user_id).
When you execute SELECT * FROM users WHERE email = 'alice@example.com', Cassandra first consults this hidden index table to find the user_id associated with that email. Once it has the user_id, it then performs a standard lookup on the users table using that user_id as the primary key.
The problem is, this approach scales poorly and has significant performance implications, especially as your data grows or when the indexed column has low cardinality (meaning many rows share the same value).
Here’s the core issue: Cassandra secondary indexes are essentially a "lookup" mechanism, not a true filtering mechanism at the storage level. Each query against a secondary index triggers two lookups: one on the index table and then another on the main data table. This is computationally more expensive than a direct primary key lookup.
Furthermore, you cannot use secondary indexes on columns that are not part of the primary key unless you are querying for equality (=). You can’t use range queries (>, <, >=, <=) or LIKE clauses on indexed columns. If you try, Cassandra will reject the query, often with an error like InvalidRequest: Unsupported operation: ... or Cannot execute this query as it is not supported by the secondary index..
The most significant limitation, and the one that trips up most users, is performance degradation with low cardinality columns. If you have an index on a column like status where the possible values are 'active', 'inactive', 'pending' (only a few distinct values), and you have millions of rows, a query like SELECT * FROM users WHERE status = 'active' will force Cassandra to:
- Look up 'active' in the index table.
- Retrieve all
user_ids associated with 'active'. - Perform a separate read for each of those
user_ids on the mainuserstable.
This can lead to massive read amplification, where a single logical query results in hundreds or thousands of actual disk reads, bringing your cluster to its knees. This is why you’ll often see recommendations to avoid secondary indexes on columns with fewer than 100 distinct values relative to the total number of rows.
Another critical limitation is scalability with data volume. As the number of rows increases, the index table also grows. If the indexed column has high cardinality (many unique values), the index table can become very large, leading to longer query times as Cassandra has to scan more entries in the index.
"Which columns can I index?" is a common question. You can only index columns that are not part of the primary key. If a column is part of the primary key, you can query it directly as part of your primary key lookup. You can also only query by equality. Trying to query using WHERE indexed_column > 'value' will fail.
"Can I use a composite index?" No, Cassandra does not support composite secondary indexes. You can only create an index on a single column at a time. If you need to query on multiple columns, you’ll likely need to denormalize your data by creating multiple tables, each with a different primary key structure optimized for a specific query pattern.
"What if I need to query by multiple columns?" This is where denormalization is key. Instead of relying on secondary indexes, create separate tables. For example, if you need to query users by email and also by username, you’d create two tables:
CREATE TABLE users_by_email (
email text,
user_id UUID,
username text,
PRIMARY KEY (email, user_id) // Optimized for email lookups
);
CREATE TABLE users_by_username (
username text,
user_id UUID,
email text,
PRIMARY KEY (username, user_id) // Optimized for username lookups
);
You would then write application logic to write to both tables simultaneously. This is the Cassandra way: write for your read patterns.
Finally, secondary indexes have a significant impact on write performance. Every INSERT, UPDATE, or DELETE operation on a table with a secondary index incurs an additional write to the corresponding index table. This overhead can slow down write throughput, especially in high-volume write scenarios.
The next problem you’ll likely encounter after understanding these limitations is how to effectively model your data for complex query patterns without secondary indexes.