ClickHouse’s query cache can save you a ton of CPU cycles by serving results from memory instead of re-executing identical queries.
Let’s see it in action. Imagine you have a table events with a few million rows, and you run a frequently accessed summary query:
SELECT
toDate(event_time) AS event_date,
count() AS event_count
FROM events
WHERE event_time BETWEEN '2023-10-26 00:00:00' AND '2023-10-26 23:59:59'
GROUP BY event_date
ORDER BY event_date;
Running this query involves scanning a significant portion of the events table, which can be quite slow.
Now, let’s enable the query cache. In your ClickHouse configuration file (e.g., /etc/clickhouse-server/config.xml or a file in /etc/clickhouse-server/config.d/), you’ll add or modify the <query_cache> section.
<clickhouse>
<!-- ... other settings ... -->
<query_cache>
<max_size>10737418240</max_size> <!-- 10 GiB -->
<max_elements>100000</max_elements>
<enable_regexp>1</enable_regexp>
<storage_configuration>
<max_memory_usage>10737418240</max_memory_usage> <!-- 10 GiB -->
</storage_configuration>
</query_cache>
<!-- ... other settings ... -->
</clickhouse>
After restarting the ClickHouse server, the query cache is active. The first time you run the summary query, ClickHouse will execute it as usual, scan the data, and store the result in the query cache.
The next time you run the exact same query (same SQL text, same parameters), ClickHouse will detect a cache hit. Instead of hitting the disk and processing the data, it will immediately return the result from memory. You can verify this by looking at the query profiles. The first execution will show significant CPU and I/O, while the cached execution will be almost instantaneous, with minimal CPU usage.
The query cache works by hashing the entire SQL query string. If a new query’s hash matches an existing entry in the cache, and the data relevant to that query hasn’t changed, the cached result is returned. It’s crucial that the query is identical. Even a subtle difference in whitespace or capitalization can lead to a cache miss.
The cache invalidation is handled automatically. When data in a table is modified (inserts, updates, deletes), ClickHouse marks any cached query results that depend on that data as stale. These stale entries are then removed from the cache, ensuring you don’t get outdated results.
The key levers you control are:
max_size: The maximum total size of cached query results in bytes. Here,10737418240is 10 GiB. This prevents the cache from consuming all available system memory.max_elements: The maximum number of distinct query results to store.100000means it will store up to 100,000 unique query results.enable_regexp: When set to1, it enables caching for queries that use regular expressions. This can be powerful but also increase cache churn if your regex patterns are very broad.storage_configuration.max_memory_usage: This is a crucial setting within the<query_cache>block that dictates how much RAM the cache can consume. It’s often set to the same value asmax_sizeif you want to dedicate a specific amount of memory to the cache.
The query cache is scoped per replica. If you have multiple ClickHouse servers, each one maintains its own independent query cache. Changes to one replica’s cache do not propagate to others.
A common pitfall is assuming the cache will work with queries that are semantically similar but not textually identical. For instance, if your query includes a parameter that changes, like WHERE event_time BETWEEN '2023-10-26 00:00:00' AND '2023-10-26 23:59:59', and you then query for '2023-10-27', it will be a cache miss because the SQL string is different. For parameterized queries, you’d need to use prepared statements or ensure the parameter is part of the cached key if the caching mechanism supports it (ClickHouse’s default query cache is based on the literal SQL string).
The query cache invalidation logic is tied to mutations and merges. When a table is mutated or parts are merged, ClickHouse analyzes which cached query results might be affected. It’s not instantaneous; there’s a small delay as these background processes complete. For very high-write scenarios where immediate cache invalidation is critical, you might need to tune merge settings or accept a slight lag.
The next step is understanding how to selectively disable the query cache for specific queries, perhaps when you know a query will always produce a fresh result or when you want to force a re-computation for benchmarking.