Projections are ClickHouse’s answer to indexing, but they’re far more powerful and flexible, allowing you to pre-aggregate and pre-sort data for specific query patterns without sacrificing the flexibility of your base table.

Let’s see a projection in action. Imagine you have a massive events table logging user actions, and you frequently query the count of events for a specific user_id within a date range.

-- Original table structure
CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type String,
    payload String
) ENGINE = MergeTree()
ORDER BY (event_time, user_id);

-- Inserting some sample data (imagine millions/billions of rows)
INSERT INTO events VALUES
('2023-10-26 10:00:00', 123, 'login', '{"ip": "192.168.1.1"}'),
('2023-10-26 10:05:00', 456, 'click', '{"button": "buy"}'),
('2023-10-26 10:10:00', 123, 'view', '{"page": "/home"}'),
('2023-10-27 11:00:00', 789, 'login', '{"ip": "10.0.0.5"}'),
('2023-10-27 11:15:00', 123, 'logout', '{}');

-- A common query
SELECT count(*)
FROM events
WHERE user_id = 123 AND event_time BETWEEN '2023-10-26 00:00:00' AND '2023-10-27 23:59:59';

Without a projection, ClickHouse would have to scan through the event_time and user_id columns of the base table, potentially reading a huge amount of data.

Now, let’s add a projection specifically for this query pattern:

-- Define a projection
ALTER TABLE events
ADD PROJECTION events_by_user_time (
    SELECT
        user_id,
        event_time,
        count(*) AS event_count
    FROM events
    GROUP BY user_id, event_time
);

When you run the same query again, ClickHouse will intelligently choose to use the events_by_user_time projection. It’s already pre-aggregated count(*) for each user_id and event_time combination, and it’s likely stored in a more optimized format for this specific access pattern.

The mental model for projections is built around materialized views, but with a key difference: projections are automatic and transparent. When you define a projection, ClickHouse creates a secondary data structure that is populated in the background as data is inserted into the base table. You don’t query the projection directly; ClickHouse’s query planner decides whether to use the projection or the base table based on the query’s WHERE and GROUP BY clauses.

The SELECT part of the projection definition dictates what data is stored in the projection. This can be a subset of columns, aggregated values, or even a different ordering. The GROUP BY clause is crucial for pre-aggregations. The ORDER BY clause within the projection definition (which is implicit in the GROUP BY columns here for simplicity, but can be explicit) determines how data is sorted within that projection’s data files, further optimizing lookups.

The ALTER TABLE events MODIFY ORDER BY ... command is how you would change the primary key of the base table itself, which dictates the physical storage order of data. Projections, on the other hand, are additional structures. Think of your base table’s ORDER BY as the main filing system, and projections as specialized, smaller filing cabinets for specific types of documents that you access frequently.

A common misconception is that projections are just for simple aggregations. You can get quite complex. For example, if you often need the sum of payload lengths for a given user_id and event_type:

ALTER TABLE events
ADD PROJECTION payload_len_by_user_type (
    SELECT
        user_id,
        event_type,
        sum(length(payload)) AS total_payload_length
    FROM events
    GROUP BY user_id, event_type
);

This projection will store pre-calculated total_payload_length values, making queries like SELECT sum(total_payload_length) FROM events WHERE user_id = 123 AND event_type = 'click' incredibly fast. ClickHouse’s query optimizer is smart enough to see that sum(length(payload)) can be satisfied by sum(total_payload_length) from this projection.

The most powerful aspect of projections is that they don’t require you to change your application’s queries. You define the projection, and ClickHouse does the rest. It’s also important to remember that projections are materialized. This means they consume disk space, just like the base table. You need to balance the query speed gains against the storage overhead. If a projection isn’t being used by any common queries, it’s just wasted space and insertion overhead. You can check projection usage by querying system.query_log.

The next step after mastering projections is understanding how to combine them with ClickHouse’s data skipping capabilities for even more granular performance improvements.

Want structured learning?

Take the full Clickhouse course →