ClickHouse schemas are surprisingly rigid, and the ORDER BY clause in your table definition is the single most important factor dictating query performance.

Let’s see this in action with a common OLAP scenario: analyzing sales data. Imagine we have a table sales with columns like sale_id, timestamp, product_id, customer_id, and amount.

CREATE TABLE sales (
    sale_id UUID,
    timestamp DateTime,
    product_id UInt32,
    customer_id UInt32,
    amount Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (product_id, timestamp);

In this example, the ORDER BY (product_id, timestamp) is crucial. When data is inserted into ClickHouse, it’s physically sorted on disk according to these columns. This means all sales for a particular product_id will be grouped together, and within each product_id group, the sales will be sorted by timestamp.

This physical sorting is the bedrock of ClickHouse’s OLAP performance. When you query data, ClickHouse can efficiently skip over large chunks of data that don’t match your WHERE clause conditions. For instance, if you query sales for product_id = 123 and timestamp BETWEEN '2023-01-01' AND '2023-01-31', ClickHouse doesn’t need to scan the entire table. It can quickly locate the data blocks corresponding to product_id = 123 and then, within those blocks, find the relevant time range. This is called "sorting key" or "primary key" in ClickHouse parlance, and it’s what enables its lightning-fast analytical queries.

The primary goal when designing a schema for OLAP in ClickHouse is to align your ORDER BY clause with your most frequent and performance-critical query patterns. Think about the columns you’ll most commonly filter on, group by, or use in WHERE clauses. These should ideally form the prefix of your ORDER BY key.

For example, if your primary use case is to analyze sales by region and then by date, your ORDER BY clause should reflect that:

CREATE TABLE sales_by_region (
    sale_id UUID,
    timestamp DateTime,
    region String,
    product_id UInt32,
    customer_id UInt32,
    amount Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (region, timestamp);

Here, region comes first because it’s likely to be a more selective filter than timestamp for many analytical queries. If you commonly filter by region and product_id, you might consider ORDER BY (region, product_id, timestamp). The order matters significantly.

You can also include a PRIMARY KEY clause, which is a subset of the ORDER BY key. This gives ClickHouse additional information for even more efficient data skipping. If you don’t specify a PRIMARY KEY, ClickHouse defaults to using the entire ORDER BY key.

CREATE TABLE sales_with_pk (
    sale_id UUID,
    timestamp DateTime,
    region String,
    product_id UInt32,
    customer_id UInt32,
    amount Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (region, product_id, timestamp)
PRIMARY KEY (region, product_id);

In this case, PRIMARY KEY (region, product_id) tells ClickHouse that it can use these columns to skip data even more aggressively.

One of the most counterintuitive aspects of ClickHouse schema design is that the ORDER BY clause must be a prefix of the physical data sorting. This means if you define ORDER BY (a, b, c), you can efficiently query on a, a AND b, or a AND b AND c. However, querying solely on b or c, or even a AND c without b, will be significantly less performant because ClickHouse has to scan through more data. The physical data layout is dictated by the full ORDER BY key, and you can only effectively prune data based on prefixes of that key. This is why careful planning of your most common query filters is paramount.

Beyond the ORDER BY clause, consider your data types. Using the most appropriate and smallest possible data types (e.g., UInt16 instead of Int32 if your IDs are always positive and small) can reduce storage footprint and improve cache efficiency. Also, think about denormalization. For OLAP, it’s often beneficial to denormalize your data and join dimension tables into your fact tables before inserting into ClickHouse, as complex joins at query time are expensive.

After optimizing your ORDER BY clause and data types, the next common performance bottleneck you’ll encounter is related to data ingestion rates and the size of your data parts on disk, which leads to understanding MergeTree’s background merge processes.

Want structured learning?

Take the full Clickhouse course →