CockroachDB’s secondary indexes don’t just speed up queries; they fundamentally change how data is accessed, allowing you to treat your data as if it were sorted by multiple columns simultaneously, even though it’s physically stored only once.
Let’s see this in action. Imagine a table users with a primary key on user_id:
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username STRING UNIQUE,
email STRING UNIQUE,
signup_date TIMESTAMP
);
Without a secondary index on signup_date, querying for users who signed up recently would require a full table scan, even if we only wanted a few users:
-- Slow, full scan
SELECT user_id, username FROM users WHERE signup_date > '2023-10-26 00:00:00';
Now, let’s add a secondary index:
CREATE INDEX users_signup_date_idx ON users (signup_date);
CockroachDB now maintains a separate data structure, sorted by signup_date. When you run the same query, the database can use this index to locate the relevant rows much faster:
-- Fast, index scan
SELECT user_id, username FROM users WHERE signup_date > '2023-10-26 00:00:00';
The EXPLAIN plan for the second query will show scan using users_signup_date_idx, indicating it’s using the index.
This index creation comes at a cost: every INSERT, UPDATE, and DELETE on the users table now also has to update the users_signup_date_idx. This is why you don’t just index every column. The decision hinges on access patterns versus write overhead.
The core problem secondary indexes solve is the impedance mismatch between relational data models and efficient data retrieval. A table is physically stored according to its primary key. If you frequently query based on other columns, you’re forced into inefficient full table scans. Secondary indexes create alternative "views" of the data, sorted by different columns, allowing for index-seek operations.
Internally, CockroachDB uses a variation of B-trees for its indexes. A secondary index on signup_date in our users table would store pairs of (signup_date, user_id). When you query WHERE signup_date > '2023-10-26', CockroachDB traverses this B-tree to find all entries where signup_date meets the condition. For each matching entry, it retrieves the associated user_id and then uses that user_id to look up the rest of the row data in the primary index.
The key levers you control are:
- Which columns to index: Based on your
WHEREclauses,JOINconditions, andORDER BYclauses. - Index type:
STORINGclauses can include additional columns that are then "covered" by the index, avoiding a lookup to the primary table for those specific columns. For example,CREATE INDEX users_signup_date_idx ON users (signup_date) STORING (username);would allow fetchingusernamedirectly from the index. - Index selectivity: An index is most effective when it significantly narrows down the number of rows to examine. An index on a high-cardinality column (like
user_idoremail) is generally more selective than one on a low-cardinality column (like a booleanis_activeflag).
CockroachDB’s distributed nature adds a layer of complexity. When an index is created, it’s a distributed data structure. The index entries are sharded and replicated alongside the table data, ensuring that index lookups can often be performed locally to the data they point to, minimizing network hops. This is crucial for performance at scale.
A common misconception is that a composite index ON (col1, col2) behaves identically to two separate indexes ON (col1) and ON (col2). This is not true. A composite index on (col1, col2) is optimized for queries that filter or sort on col1 or on col1 and col2 together. It cannot efficiently serve a query that only filters on col2 because the index is physically sorted first by col1, and then by col2 within each col1 partition.
The next concept to explore is how to optimize queries using EXPLAIN and understand index selectivity.