BigQuery treats nested and repeated data not as a special case, but as a fundamental building block for its schema design.
Let’s see it in action with a common scenario: tracking user activity on a website. A user might perform multiple actions, and each action can have various properties.
{
"user_id": "user123",
"timestamp": "2023-10-27T10:00:00Z",
"actions": [
{
"type": "page_view",
"details": {
"page_url": "/home",
"referrer": "google.com"
}
},
{
"type": "click",
"details": {
"element_id": "buy_button",
"element_text": "Add to Cart"
}
},
{
"type": "page_view",
"details": {
"page_url": "/products/widget",
"referrer": "/home"
}
}
]
}
In BigQuery, this would be represented by a table with a user_id (string), timestamp (timestamp), and an actions field. The actions field would be of type RECORD (also known as STRUCT) and REPEATED. This means actions is a list of records, where each record represents a single action.
The details field within each action is itself a RECORD, allowing us to group related properties like page_url and referrer together.
Here’s how you’d create such a table in BigQuery:
CREATE TABLE your_dataset.user_activity (
user_id STRING,
timestamp TIMESTAMP,
actions ARRAY<STRUCT<
type STRING,
details STRUCT<
page_url STRING,
referrer STRING,
element_id STRING,
element_text STRING
>
>>
);
When you query this data, BigQuery’s SQL dialect (Standard SQL) uses UNNEST to deconstruct the repeated fields. If you wanted to find all 'click' actions and the user_id associated with them:
SELECT
ua.user_id,
action.type,
action.details.element_id
FROM
your_dataset.user_activity AS ua,
UNNEST(ua.actions) AS action
WHERE
action.type = 'click';
This query first flattens the actions array, creating a new row for each action within a user’s event. Then, it filters those flattened rows to only include 'click' events. The ua.user_id is repeated for each action from the same user, which is why you might see the same user_id multiple times in the result if a user had multiple clicks. If you wanted unique users who clicked, you’d add DISTINCT ua.user_id.
The core problem this solves is representing complex, hierarchical, and variable-length data within a flat, columnar store like BigQuery. Traditional relational databases would often require multiple tables with foreign keys to represent this kind of relationship (e.g., a users table, an actions table, and a action_details table). This leads to complex joins, which can be slow and expensive. BigQuery’s nested and repeated fields allow you to denormalize your data into a single table, drastically simplifying queries and improving performance for analytical workloads. It’s particularly powerful for semi-structured data sources like JSON logs, API responses, or event streams.
The ability to query nested fields directly, without explicit joins, is what makes this so efficient. For instance, action.details.element_id directly accesses a field within a nested structure. BigQuery’s query optimizer is designed to efficiently traverse these structures.
When dealing with repeated fields, it’s crucial to understand that each element in the ARRAY is treated as a separate entity once UNNESTed. This means that if you have a filter condition that applies to one element of a repeated field, it won’t affect other elements of the same repeated field within the same row unless you explicitly link them. For example, if you wanted to find users who had both a 'page_view' and a 'click' action, you’d need to UNNEST the actions array twice, aliasing them differently, and then join them back on user_id and timestamp (or a unique event ID if you had one) to ensure you’re correlating actions from the same event.
The next logical step is to explore how to aggregate data within repeated fields, such as counting the number of actions of a specific type for each user.