ClickHouse’s data skipping indexes don’t just skip data; they fundamentally change how the query planner sees your data, allowing it to avoid reading entire blocks of files based on simple statistical summaries.
Let’s see it in action. Imagine a table events with a timestamp column and a user_id column, and we’re frequently querying for events within a specific time range for a particular user.
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String,
event_data String
) ENGINE = MergeTree()
ORDER BY (user_id, toDate(timestamp));
Now, let’s add a data skipping index on timestamp.
ALTER TABLE events
ADD INDEX ts_idx timestamp TYPE minmax GRANULARITY 1;
GRANULARITY 1 means the index is computed for every single row. This is often too much; a common, practical GRANULARITY is 8192.
Let’s insert some data:
INSERT INTO events VALUES
('2023-10-26 10:00:00', 101, 'login', '{}'),
('2023-10-26 10:05:00', 101, 'view', '{"page": "/home"}'),
('2023-10-26 10:10:00', 102, 'click', '{"button": "buy"}'),
('2023-10-27 11:00:00', 101, 'logout', '{}'),
('2023-10-27 11:05:00', 101, 'view', '{"page": "/profile"}'),
('2023-10-28 12:00:00', 103, 'login', '{}');
When you run a query like this:
SELECT * FROM events WHERE user_id = 101 AND timestamp BETWEEN '2023-10-26 10:00:00' AND '2023-10-26 10:30:00';
ClickHouse doesn’t just scan the entire events table. It consults the ts_idx for each data part. If the minmax range stored in the index for a particular data part falls entirely outside the query’s timestamp range, that entire data part is skipped.
The core idea is that ClickHouse stores small summary statistics (like minimum and maximum values) for your indexed columns within each "granule" of data. A granule is a block of rows (defaulting to 8192 rows). When you query a range, ClickHouse checks these summaries. If the entire range of the granule is outside your query’s range, ClickHouse doesn’t even bother to read the actual data for that granule.
The ORDER BY clause is crucial here. ClickHouse physically sorts data on disk based on the ORDER BY key. This means that rows with similar user_id and toDate(timestamp) values are stored together. When you query user_id = 101 and a timestamp range, ClickHouse can leverage both the primary key sorting and the data skipping index. It first finds the data parts relevant to user_id = 101 (due to ORDER BY) and then applies the ts_idx to skip granules within those relevant parts.
The TYPE of the data skipping index is important. minmax is the most basic, storing just the minimum and maximum values. Other types like set (stores a list of unique values) or bloom_filter can be more effective for different query patterns. For range queries on timestamps, minmax is usually a good starting point.
The GRANULARITY setting directly impacts the trade-off between index size/query overhead and the effectiveness of data skipping. A smaller GRANULARITY means more granules, more index entries, and potentially better skipping, but also higher memory usage for the index. A larger GRANULARITY means fewer index entries, less memory, but potentially less granular skipping. For MergeTree tables, GRANULARITY is set per data part, not per table. The default is 8192.
The ORDER BY clause in your CREATE TABLE statement defines the physical sorting of data on disk. This is your primary key. For efficient range queries, you want the columns you filter on in your WHERE clause to be at the beginning of the ORDER BY key. In our events example, ORDER BY (user_id, toDate(timestamp)) is good because we often filter by user_id first, then timestamp. If you frequently query by timestamp without user_id, you might consider ORDER BY (toDate(timestamp), user_id).
The most surprising thing about data skipping indexes is their interaction with the primary key. While an index can skip data based on its own statistics, the primary key sorting dictates which blocks of data are even considered for index lookups. If your query’s primary filter isn’t covered by the ORDER BY key, the index might be less effective because ClickHouse has to scan more data parts before even checking the index.
The next optimization you’ll likely consider is using a set index on user_id if you have a relatively small number of distinct users and frequently query specific user IDs.