The most surprising thing about time-series data in ClickHouse is how little it resembles traditional relational data, even though it lives in the same database.

Let’s see it in action. Imagine we’re tracking sensor readings from IoT devices. We’ve got millions of readings per minute, each with a timestamp, device ID, and a set of metrics.

CREATE TABLE sensor_readings (
    timestamp DateTime64(3),
    device_id UInt32,
    temperature Float32,
    humidity Float32,
    pressure Float32
) ENGINE = MergeTree()
ORDER BY (device_id, timestamp);

This MergeTree table is the workhorse. The ORDER BY (device_id, timestamp) is critical. ClickHouse physically sorts data on disk based on this tuple. When you query for a specific device_id and a time range, ClickHouse can efficiently jump to the relevant data blocks without scanning the entire table.

Here’s a typical query to get temperature readings for device_id = 123 between two timestamps:

SELECT timestamp, temperature
FROM sensor_readings
WHERE device_id = 123
  AND timestamp BETWEEN '2023-10-26 10:00:00' AND '2023-10-26 11:00:00';

Because the data is sorted by device_id first, ClickHouse can quickly locate all data for device_id = 123. Then, within those blocks, it uses the sorted timestamp to find only the data within the specified range. This is orders of magnitude faster than a table not sorted this way.

The MergeTree engine itself is a marvel. It stores data in sorted parts on disk. When new data is inserted, it’s written to a new part. Periodically, ClickHouse merges these smaller parts into larger, more optimized ones in the background. This merge process also re-sorts and de-duplicates data according to the ORDER BY clause.

To optimize further, especially for queries that aggregate data over time, we can use sparse primary indexes. The default primary index in MergeTree is a B-tree index on the ORDER BY columns. However, for time-series, we often want to index granules of data. ClickHouse’s primary index is sparse by default, meaning it only indexes the first row of each data granule. A granule is a block of rows (default 8192). This significantly reduces the index size while still allowing ClickHouse to skip large portions of data.

For time-series, the ORDER BY clause is king. If your queries frequently filter by device_id and then timestamp, ORDER BY (device_id, timestamp) is optimal. If you often query a global time range across all devices, ORDER BY timestamp might seem appealing, but it often leads to slower performance for device-specific queries. The key is to order by the most selective columns first.

The DateTime64(3) type for timestamp is important. The (3) signifies millisecond precision. If your data has microsecond or nanosecond precision, you’d adjust this accordingly. Using the most precise DateTime64 that matches your data’s actual precision is generally best, as it avoids unnecessary precision loss or storage bloat.

Consider the MATERIALIZED columns feature. If you frequently need to extract the hour, day, or month from your timestamp for aggregation, you can materialize it:

CREATE TABLE sensor_readings_optimized (
    timestamp DateTime64(3),
    device_id UInt32,
    temperature Float32,
    humidity Float32,
    pressure Float32,
    hourOfDay ALIAS toHour(timestamp)
) ENGINE = MergeTree()
ORDER BY (device_id, timestamp);

This ALIAS column doesn’t store data directly but computes it on the fly. For better performance on aggregated queries, you could make it MATERIALIZED:

CREATE TABLE sensor_readings_materialized (
    timestamp DateTime64(3),
    device_id UInt32,
    temperature Float32,
    humidity Float32,
    pressure Float32,
    hourOfDay UInt8 MATERIALIZED toHour(timestamp)
) ENGINE = MergeTree()
ORDER BY (device_id, timestamp, hourOfDay); -- Add materialized column to ORDER BY for efficiency

Now, hourOfDay is physically stored and indexed. You can even include it in your ORDER BY clause for faster aggregations by hour. The best ORDER BY clause often includes a mix of your primary filter (like device_id), the time dimension, and any frequently used derived time dimensions.

The real magic of ClickHouse for time-series isn’t just the storage engine, but how it leverages the sorted data. When you query WHERE device_id = 123 AND timestamp BETWEEN ..., ClickHouse uses its primary index to find the granules that contain device_id = 123. Within those granules, it then uses the fact that timestamp is also sorted to perform a binary search or a simple range scan, quickly discarding irrelevant data blocks. This is why the ORDER BY clause is the single most important decision you make for time-series performance.

The next hurdle you’ll face is managing the sheer volume of data, often leading to considerations around data retention and partitioning.

Want structured learning?

Take the full Clickhouse course →