ClickHouse dictionaries are not just for static lookups; they can significantly accelerate joins with large tables by acting as in-memory hash tables.
Let’s see how a dictionary lookup can speed up joining a large events table with a smaller users table.
Imagine our events table has billions of rows, and we want to enrich each event with user details from the users table. A standard JOIN could be prohibitively slow.
-- Standard JOIN (potentially slow)
SELECT
e.event_id,
e.event_type,
u.user_name,
u.country
FROM events AS e
JOIN users AS u ON e.user_id = u.user_id
WHERE e.event_date = '2023-10-27';
Instead, we can create a ClickHouse dictionary based on the users table. This dictionary will load into memory as a hash map, allowing for O(1) average-case lookups.
First, define the dictionary source in a configuration file (e.g., /etc/clickhouse-server/dictionary_sources.xml):
<dictionaries>
<dictionary>
<name>users_dict</name>
<source>
<table>users</table>
</source>
<lifetime>
<min>300</min>
<max>3600</max>
</lifetime>
<layout>
<hash>
<bucket_size>1024</bucket_size>
</hash>
</layout>
<attributes>
<attribute>
<name>user_name</name>
<type>String</type>
<null_value>NULL</null_value>
</attribute>
<attribute>
<name>country</name>
<type>String</type>
<null_value>NULL</null_value>
</attribute>
</attributes>
</dictionary>
</dictionaries>
Then, create the dictionary table in ClickHouse:
CREATE DICTIONARY users_dict
(
user_id UInt64,
user_name String,
country String
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'default' PASSWORD '' DB 'your_db' TABLE 'users'))
LIFETIME(MIN 300 MAX 3600)
LAYOUT(HASH(BUCKET_SIZE 1024));
Now, rewrite the query to use the dictionary for lookup:
-- Query using dictionary lookup
SELECT
e.event_id,
e.event_type,
dictGet(users_dict, 'user_name', e.user_id) AS user_name,
dictGet(users_dict, 'country', e.user_id) AS country
FROM events AS e
WHERE e.event_date = '2023-10-27';
The dictGet(dictionary_name, attribute_name, key) function performs the lookup. It takes the dictionary name, the attribute you want to retrieve, and the key value from the events table.
This approach transforms a potentially disk-bound JOIN operation into an in-memory hash table lookup, dramatically reducing query latency. The dictionary is periodically refreshed based on the LIFETIME settings, ensuring data consistency without manual intervention.
The layout.hash.bucket_size parameter in the dictionary configuration is crucial. It determines the initial size of the hash table. For optimal performance, this should be set to a value that can comfortably accommodate the expected number of unique user_ids. If the dictionary grows beyond its initial bucket size, ClickHouse will rehash and resize, which can cause temporary performance degradation.
The lifetime settings (min and max) control how often ClickHouse checks the source table for updates and reloads the dictionary. min is the minimum time between reloads, and max is the maximum time. This ensures that the dictionary remains reasonably fresh without constantly querying the source table.
ClickHouse’s dictionary mechanism is particularly effective when joining a fact table (like events) with a large number of dimension tables, where dimensions are relatively static or change infrequently. Instead of joining multiple large tables, you can load key dimensions as dictionaries and perform lookups, significantly reducing the computational burden.
One subtle but powerful aspect is that dictionaries can be defined with multiple primary keys, enabling lookups based on combinations of attributes. This is useful for scenarios where a simple user_id isn’t sufficient to uniquely identify a record in the source table.
The next logical step is to explore using external dictionaries, which allows you to load data from sources outside of ClickHouse itself, such as Redis or even files, further expanding the use cases for high-speed lookups.