ClickHouse’s MergeTree engine doesn’t just store data; it actively reorganizes it in the background to make queries blazing fast.
Let’s see it in action. Imagine you have a table events with a timestamp and an event_type.
CREATE TABLE events (
event_time DateTime,
event_type String,
value UInt64
) ENGINE = MergeTree()
ORDER BY (event_type, event_time);
Now, we insert some data:
INSERT INTO events VALUES (now(), 'login', 1), (now() - INTERVAL 1 HOUR, 'logout', 1), (now() - INTERVAL 2 HOUR, 'login', 1);
When you query this:
SELECT count() FROM events WHERE event_type = 'login';
ClickHouse doesn’t scan every single row. Because we ordered by event_type and event_time, it can use a "sparse index" to jump directly to the blocks of data relevant to 'login', and then quickly scan just those.
The core idea is that data is written in immutable "parts" on disk. When you INSERT, a new part is created. To keep things efficient, MergeTree periodically merges these smaller parts into larger ones. This process is called "merging."
Here’s what happens internally:
- Data is written in parts: Each
INSERTorALTER UPDATE/DELETEoperation creates a new, immutable part on disk. These parts are stored in a directory structure specific to the table. - Background merges: A background process continuously scans for small parts and merges them into larger ones. This reduces the number of files to manage and improves data locality.
- Sorting within parts: Crucially, data within each part is sorted according to the
ORDER BYkey specified in theCREATE TABLEstatement. This is the foundation of ClickHouse’s query performance. - Primary Index: For each part, ClickHouse creates a primary index (a sparse index by default) based on the
ORDER BYkey. This index allows queries to quickly locate relevant data blocks without scanning the entire part. - Data Deduplication: If you insert the same data multiple times within a short period (controlled by
INSERT_QUOTA_MAX_PERIOD), MergeTree can automatically deduplicate it during merges.
Think of it like this: you have stacks of paper (data parts). When you need to find something, you don’t want to sift through hundreds of tiny stacks. MergeTree periodically takes smaller stacks, combines them into bigger, more organized stacks, and makes sure related papers are together. The ORDER BY key is like deciding how you’ll sort the papers within each stack – by date, by topic, etc.
The ORDER BY key is not just for sorting; it’s the primary key for indexing. If your ORDER BY is (UserID, EventTimestamp), ClickHouse will sort all data within a part by UserID first, then by EventTimestamp. This means a query like SELECT ... FROM table WHERE UserID = 123 will be incredibly fast because it can jump to all data for UserID = 123 and then efficiently scan the relevant time range.
The actual merging process is managed by a background service. You can see ongoing merges in the system.merges table.
SELECT * FROM system.merges WHERE is_active;
You can also manually trigger merges using OPTIMIZE TABLE events FINAL;. The FINAL keyword ensures that all possible merges are performed, including deduplication if applicable. This is useful for cleaning up after many small inserts or ALTER DELETE operations.
OPTIMIZE TABLE events FINAL;
The most surprising thing about MergeTree’s sorting and merging is how it handles deleted data. When you run ALTER TABLE events DELETE WHERE event_type = 'logout', ClickHouse doesn’t immediately remove those rows. Instead, it marks them for deletion. The actual removal happens during the subsequent merges, where the deleted rows are simply omitted from the new, larger merged part. This "lazy deletion" strategy is key to maintaining write performance and avoiding overhead during DELETE operations.
The next step in understanding MergeTree is how it handles different data types and the impact of ORDER BY choices on query patterns.