ClickHouse doesn’t just tell you that a query was slow; it can show you exactly why, down to the microsecond, by letting you peer into the execution plan as it runs.
Let’s watch a slow query get dissected. Imagine this query is grinding your dashboard to a halt:
SELECT
toDate(event_time) AS event_date,
count() AS event_count
FROM events
WHERE event_time BETWEEN '2023-10-01 00:00:00' AND '2023-10-31 23:59:59'
GROUP BY event_date
ORDER BY event_date;
This query, at first glance, looks innocent. It’s a simple aggregation over a date range. But if it’s slow, we need to see where the time is being spent.
The first tool is EXPLAIN. Running EXPLAIN on your query gives you the intended execution plan, not the actual one. It’s a blueprint.
EXPLAIN SELECT toDate(event_time) AS event_date, count() AS event_count FROM events WHERE event_time BETWEEN '2023-10-01 00:00:00' AND '2023-10-31 23:59:59' GROUP BY event_date ORDER BY event_date;
This will output something like:
-> SELECT toDate(event_time) AS event_date, count() AS event_count FROM events WHERE event_time BETWEEN '2023-10-01 00:00:00' AND '2023-10-31 23:59:59' GROUP BY event_date ORDER BY event_date
Not very helpful yet! EXPLAIN in ClickHouse is often used with PIPELINE or SYNTAX to see the structure, but for performance, we need to see it in action.
To do that, we enable trace logs. This requires a configuration change. In your config.xml or a file in config.d/, you’d add or modify the <log> section:
<clickhouse>
<log>
<level>trace</level>
<log_queries>1</log_queries>
<log_query_threads>1</log_query_threads>
<log_query_path>/var/log/clickhouse-server/query_trace.log</log_query_path>
<log_query_max_size>104857600</log_query_max_size> <!-- 100MB -->
</log>
</clickhouse>
After restarting the ClickHouse server, slow queries (by default, queries taking more than 1 second) will start appearing in /var/log/clickhouse-server/query_trace.log with detailed timing information.
Let’s simulate a slow query and look at the trace log output. Suppose our events table is massive and the event_time column isn’t optimally sorted or indexed. The trace log might show something like this (simplified):
2023-10-27 10:30:00.123456 [12345] <Trace> void DB::executeQuery(const DB::String&, const DB::BlockIO&, bool, bool, bool, bool, bool) - Query: SELECT toDate(event_time) AS event_date, count() AS event_count FROM events WHERE event_time BETWEEN '2023-10-01 00:00:00' AND '2023-10-31 23:59:59' GROUP BY event_date ORDER BY event_date
...
2023-10-27 10:30:05.456789 [12345] <Trace> void DB::ProcessThreadPool::workerThread() - Thread 1: Processed 1000000000 rows in 5.333333 seconds. Operation: ReadPart. Path: /var/lib/clickhouse/data/default/events/202310_1/
...
2023-10-27 10:30:08.789012 [12345] <Trace> void DB::ProcessThreadPool::workerThread() - Thread 2: Processed 1000000000 rows in 3.333333 seconds. Operation: Filter. Data: 1000000000 rows filtered, 500000000 kept.
...
2023-10-27 10:30:10.111222 [12345] <Trace> void DB::ProcessThreadPool::workerThread() - Thread 3: Processed 500000000 rows in 1.333333 seconds. Operation: Aggregation.
...
2023-10-27 10:30:11.555666 [12345] <Trace> void DB::ProcessThreadPool::workerThread() - Thread 4: Processed 31 rows in 1.444444 seconds. Operation: Sorting.
The key is to look for the Operation and the associated time. Here, "ReadPart" took over 5 seconds. This indicates that ClickHouse had to scan a huge amount of raw data from disk. "Filter" took 3.3 seconds, showing that even after reading, a lot of rows were discarded. "Aggregation" was relatively fast, but "Sorting" at the end took a significant chunk of time.
The core problem here is the inefficient data scanning. ClickHouse’s performance hinges on minimizing the amount of data it needs to read and process.
Common Causes and Fixes:
-
No Primary Key or Poorly Chosen Primary Key: If
event_timeisn’t part of yourORDER BYclause (which defines the primary key for MergeTree engines), ClickHouse might scan entire data parts.- Diagnosis: Check
DESCRIBE TABLE events. Ifevent_timeisn’t the first column (or among the first few), this is likely the issue. - Fix: Recreate the table with
event_timeas the primary key. For example:-- Backup existing data if necessary CREATE TABLE events_new (...) ENGINE = MergeTree() ORDER BY (event_time, ...); -- Add other columns as needed for sorting INSERT INTO events_new SELECT * FROM events; RENAME TABLE events TO events_old, events_new TO events; DROP TABLE events_old; - Why it works: The
ORDER BYclause on a MergeTree table creates a sorted index. By placingevent_timefirst, ClickHouse can use "sparse primary index" lookups to quickly find the relevant data blocks for the date range, drastically reducing I/O.
- Diagnosis: Check
-
Wide Tables (Too Many Columns): If your
eventstable has hundreds of columns and you’re only selecting a few, ClickHouse still has to read the metadata for all columns in the scanned parts.- Diagnosis:
DESCRIBE TABLE events. Count the columns. - Fix: Create a materialized view or a new table with only the necessary columns.
CREATE MATERIALIZED VIEW events_mv TO events_minimal (event_time Date, event_count AggregateFunction(count)) AS SELECT toDate(event_time) AS event_time, count() AS event_count FROM events GROUP BY toDate(event_time); -- Then query events_mv - Why it works: ClickHouse stores data in columns. Selecting only a few columns means it only needs to read those specific columns from disk, not all of them.
- Diagnosis:
-
Large Number of Small Data Parts: Frequent small inserts can lead to many data parts. Querying across many parts incurs overhead for opening and merging them.
- Diagnosis:
SELECT count() FROM system.parts WHERE table = 'events' AND active;If the count is in the thousands, this is a problem. - Fix: Manually trigger a merge or wait for background merges. You can also adjust
background_pool_sizeandbackground_merges_mutations_concurrencyinconfig.xmlif you have many tables needing merges.OPTIMIZE TABLE events FINAL; -- Use with caution on very large tables - Why it works:
OPTIMIZE TABLE FINALforces ClickHouse to merge all data parts into a single, larger part, reducing the overhead of managing and accessing numerous small files.
- Diagnosis:
-
Unnecessary
ORDER BYat the End: TheORDER BY event_dateat the end of the query can be expensive if the number of groups is large.- Diagnosis: The trace log shows a significant time spent in the "Sorting" operation after aggregation.
- Fix: If the order doesn’t matter for the application consuming the data, remove the
ORDER BYclause. If it does, ensure theORDER BYcolumn is already sorted by the primary key or that the aggregation can produce sorted output.-- Remove ORDER BY if not strictly necessary SELECT toDate(event_time) AS event_date, count() AS event_count FROM events WHERE event_time BETWEEN '2023-10-01 00:00:00' AND '2023-10-31 23:59:59' GROUP BY event_date; - Why it works: Sorting is an O(N log N) operation. If the results are already in the desired order (e.g., due to the primary key), this step is skipped.
-
Inefficient Data Types or Functions: Using functions like
toDate()on a large number of rows can be a bottleneck.- Diagnosis: The trace log shows significant time in the "Function" or "Transform" phase where
toDateis applied. - Fix: Store
event_timeas aDatetype if possible, or pre-calculate the date during ingestion.-- If event_time is already a DateTime, casting to Date is usually fast. -- If event_time is a String, it's much slower. Convert to DateTime or Date. -- Best: Store as Date/DateTime from the start. -- If not possible, consider a materialized view to pre-calculate dates. - Why it works: Applying functions on every row during query execution is costly. Pre-calculating or storing data in a format that doesn’t require runtime transformations is much faster.
- Diagnosis: The trace log shows significant time in the "Function" or "Transform" phase where
-
Insufficient Server Resources: While the above are query-specific, a general lack of CPU or I/O capacity will slow down all queries.
- Diagnosis: Monitor system CPU, RAM, and disk I/O during query execution using tools like
htop,iostat, or ClickHouse’ssystem.metricstable. - Fix: Scale up hardware or optimize other resource-intensive queries.
- Why it works: The query is starved of the resources it needs to execute quickly.
- Diagnosis: Monitor system CPU, RAM, and disk I/O during query execution using tools like
After applying these fixes, you’ll find the trace logs show much shorter times for "ReadPart", "Filter", and "Sorting", and the overall query duration plummets.
The next challenge you’ll encounter is understanding how to use system.query_log for historical analysis of slow queries.