BigQuery’s nested data structures are so powerful because they allow you to model complex, hierarchical data directly within a relational database, blurring the lines between NoSQL and SQL.
Let’s see this in action. Imagine you’re tracking user activity on a website. Each user might perform multiple actions, and each action could have associated details. A traditional relational approach might involve separate tables for users and actions, linked by foreign keys. With BigQuery’s nested structures, we can represent this more elegantly.
Here’s a BigQuery schema definition for such a scenario:
[
{
"name": "user_id",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "user_info",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{"name": "name", "type": "STRING"},
{"name": "email", "type": "STRING"}
]
},
{
"name": "events",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{"name": "timestamp", "type": "TIMESTAMP"},
{"name": "event_type", "type": "STRING"},
{"name": "details", "type": "RECORD",
"fields": [
{"name": "page_url", "type": "STRING"},
{"name": "button_clicked", "type": "STRING"}
]
}
]
}
]
In this schema:
user_idis a simple string.user_infois aRECORD(struct) containingnameandemail.eventsis aREPEATED RECORD(an array of structs). Each element in this array represents a single event performed by the user.- Within each
event, there’s another nestedRECORDcalleddetails, which can hold specific information about that event, like thepage_urlorbutton_clicked.
Now, let’s populate this with some sample data and query it.
Sample Data Insertion (using bq load or INSERT statements):
INSERT INTO your_dataset.user_activity (user_id, user_info, events) VALUES
(
'user123',
STRUCT('Alice Smith' AS name, 'alice@example.com' AS email),
[
STRUCT(TIMESTAMP '2023-10-27 10:00:00 UTC' AS timestamp, 'page_view' AS event_type, STRUCT('homepage' AS page_url, NULL AS button_clicked) AS details),
STRUCT(TIMESTAMP '2023-10-27 10:05:15 UTC' AS timestamp, 'click' AS event_type, STRUCT('homepage' AS page_url, 'signup_button' AS button_clicked) AS details)
]
),
(
'user456',
STRUCT('Bob Johnson' AS name, 'bob@example.com' AS email),
[
STRUCT(TIMESTAMP '2023-10-27 11:15:30 UTC' AS timestamp, 'page_view' AS event_type, STRUCT('/products' AS page_url, NULL AS button_clicked) AS details)
]
);
Querying Nested Data:
To access fields within a RECORD, you use dot notation. For REPEATED fields, you use UNNEST to flatten the array into rows.
Let’s find all 'click' events and the user who performed them:
SELECT
ua.user_id,
ua.user_info.name,
event.timestamp,
event.event_type,
event.details.button_clicked
FROM
your_dataset.user_activity AS ua,
UNNEST(ua.events) AS event
WHERE
event.event_type = 'click'
AND event.details.button_clicked IS NOT NULL;
This query first selects from user_activity aliased as ua. Then, it UNNESTs the events array, making each event an individual row associated with its parent user_id. We can then access ua.user_info.name and event.details.button_clicked directly.
The result would look something like this:
| user_id | name | timestamp | event_type | button_clicked |
|---|---|---|---|---|
| user123 | Alice Smith | 2023-10-27 10:05:15.000000+00 | click | signup_button |
You can also query specific nested fields within the user_info struct without unnesting:
SELECT
user_info.name,
user_info.email
FROM
your_dataset.user_activity
WHERE
user_id = 'user123';
This would return:
| name | |
|---|---|
| Alice Smith | alice@example.com |
The real power comes when you start aggregating data across these nested structures. For instance, to count how many events each user performed:
SELECT
ua.user_id,
ARRAY_LENGTH(ua.events) AS event_count
FROM
your_dataset.user_activity AS ua;
Or, to count the number of page views per user:
SELECT
ua.user_id,
COUNT(event.event_type) AS page_view_count
FROM
your_dataset.user_activity AS ua,
UNNEST(ua.events) AS event
WHERE
event.event_type = 'page_view'
GROUP BY
ua.user_id;
The mental model to keep in mind is that RECORDs (structs) are like objects or dictionaries, and REPEATED RECORDs are like lists or arrays of those objects. BigQuery handles the storage and retrieval efficiently, but when you need to operate on individual elements of an array, UNNEST is your primary tool for bringing them into the relational context of rows.
When you construct a STRUCT or an array of STRUCTs in your INSERT statements or SELECT queries, you are explicitly defining these nested relationships. The STRUCT(value AS field_name, ...) syntax is key for creating the inner objects, and [STRUCT(...), STRUCT(...)] is for creating the arrays of these objects.
A common pitfall is forgetting to UNNEST when you want to filter or aggregate based on data within a repeated field. Treating a REPEATED field as a single value will lead to unexpected results or errors, as BigQuery expects you to explicitly expand the array into individual records for processing.
The next step in mastering nested data is often handling complex filtering and joins involving these structures, or exploring functions like ARRAY_AGG to re-aggregate data back into nested formats.