ARRAY JOIN lets you expand array elements into separate rows, essentially "unnesting" them.
Imagine you have a table of user activity logs, where each log entry has an array of tags associated with it.
CREATE TABLE user_activity (
user_id UInt32,
event_time DateTime,
tags Array(String)
) ENGINE = Memory;
INSERT INTO user_activity VALUES
(1, '2023-10-27 10:00:00', ['login', 'page_view', 'recommendation']),
(2, '2023-10-27 10:05:00', ['page_view', 'add_to_cart']),
(3, '2023-10-27 10:10:00', ['login']);
Without ARRAY JOIN, querying for specific tags requires LIKE or array functions that can be less efficient.
SELECT user_id, event_time
FROM user_activity
WHERE 'add_to_cart' IN tags;
This query will return:
| user_id | event_time |
|---|---|
| 2 | 2023-10-27 10:05:00 |
Now, let’s use ARRAY JOIN to see each tag on its own row, paired with the original user_id and event_time.
SELECT
user_id,
event_time,
tag
FROM user_activity
ARRAY JOIN tags AS tag;
This produces:
| user_id | event_time | tag |
|---|---|---|
| 1 | 2023-10-27 10:00:00 | login |
| 1 | 2023-10-27 10:00:00 | page_view |
| 1 | 2023-10-27 10:00:00 | recommendation |
| 2 | 2023-10-27 10:05:00 | page_view |
| 2 | 2023-10-27 10:05:00 | add_to_cart |
| 3 | 2023-10-27 10:10:00 | login |
This transformation is incredibly powerful for analysis. You can now easily filter, group, and aggregate based on individual array elements as if they were separate records. For instance, counting how many times each tag appears across all events:
SELECT
tag,
count() AS tag_count
FROM user_activity
ARRAY JOIN tags AS tag
GROUP BY tag
ORDER BY tag_count DESC;
This yields:
| tag | tag_count |
|---|---|
| login | 2 |
| page_view | 2 |
| recommendation | 1 |
| add_to_cart | 1 |
The core mechanism of ARRAY JOIN is that it effectively performs a CROSS JOIN between the original row and each element of the specified array. If an array has N elements, the original row will be duplicated N times, with each duplicate paired with one element from the array. If an array is empty, the row is discarded by default.
You can also ARRAY JOIN multiple arrays simultaneously. If you have Array(String) arr1 and Array(Int32) arr2, ARRAY JOIN arr1, arr2 will produce rows for every combination of elements from arr1 and arr2 present at the same index. If the arrays have different lengths, the join stops at the length of the shortest array.
The ARRAY JOIN clause is processed after the FROM and WHERE clauses but before GROUP BY, ORDER BY, and SELECT. This means that any filters applied in the WHERE clause will operate on the original rows before they are expanded by ARRAY JOIN. If you want to filter based on the unnested elements, you need to perform that filtering in the SELECT list or in a subquery after the ARRAY JOIN.
Consider the case where you want to count users who performed a specific sequence of actions. If you have actions Array(String), you might think ARRAY JOIN actions and then filtering would work. However, if a user has ['login', 'view_profile', 'logout'] and you filter for actions = 'view_profile', the ARRAY JOIN happens first, and then the WHERE clause filters those expanded rows.
SELECT user_id, actions
FROM user_activity
ARRAY JOIN actions AS action
WHERE action = 'view_profile';
This query might seem like it will give you rows where action is view_profile. But the ARRAY JOIN happens first, expanding all actions, and then the WHERE clause filters these expanded rows. The actions column in the SELECT will still contain the original array for that user_id.
To get the specific unnested action, you need to select the alias:
SELECT
user_id,
action
FROM user_activity
ARRAY JOIN tags AS action
WHERE action = 'add_to_cart';
This correctly returns:
| user_id | action |
|---|---|
| 2 | add_to_cart |
If you need to keep the original row intact while also having access to the unnested elements for filtering, you can use a subquery or a LEFT ARRAY JOIN. A LEFT ARRAY JOIN is crucial when you want to preserve rows that have empty arrays. By default, ARRAY JOIN discards rows where the array to be joined is empty.
SELECT
user_id,
event_time,
tag
FROM user_activity
LEFT ARRAY JOIN tags AS tag;
This query would include a user with an empty tags array, showing NULL for the tag column.
The performance implications of ARRAY JOIN are significant. Expanding large arrays can lead to a massive increase in the number of rows processed, potentially overwhelming memory and CPU. It’s often more efficient to use array functions like indexOf or has for simple checks if you don’t need to process each element individually. However, for aggregations and complex analyses on individual array elements, ARRAY JOIN is indispensable.
The next challenge is often dealing with nested structures within arrays or performing operations that require keeping track of the original array’s index.