BigQuery tables don’t actually store data in partitions or clusters; those are just query-time optimizations that tell BigQuery how to find your data faster.

Let’s see it in action. Imagine a table of website access logs, with a timestamp column.

-- Creating a partitioned table
CREATE TABLE your_dataset.access_logs_partitioned
PARTITION BY DATE(timestamp)
AS SELECT * FROM your_dataset.access_logs_raw;

-- Querying a specific day
SELECT COUNT(*)
FROM your_dataset.access_logs_partitioned
WHERE DATE(timestamp) = '2023-10-27';

When you query access_logs_partitioned filtering by DATE(timestamp), BigQuery only scans data from the partition corresponding to '2023-10-27', ignoring all other dates. This drastically reduces the amount of data scanned, saving you time and money.

Now, consider a table of customer orders, with customer_id and order_date. We can both partition by date and cluster by customer ID.

-- Creating a partitioned and clustered table
CREATE TABLE your_dataset.orders_partitioned_clustered
PARTITION BY DATE(order_date)
CLUSTER BY customer_id
AS SELECT * FROM your_dataset.orders_raw;

-- Querying orders for a specific customer on a specific day
SELECT SUM(order_total)
FROM your_dataset.orders_partitioned_clustered
WHERE DATE(order_date) = '2023-10-27'
  AND customer_id = 'cust_12345';

Here, BigQuery first narrows down to the 2023-10-27 partition. Within that partition, it then uses the clustering information to quickly find all rows associated with cust_12345. This is like having a filing cabinet that’s already organized by date, and then within each date’s folder, the documents are sorted by customer ID.

The core problem both partitioning and clustering solve is reducing the amount of data BigQuery needs to read to answer your query. BigQuery’s cost and performance are directly tied to the bytes scanned. Without these optimizations, a query on a multi-terabyte table could scan the entire table, taking minutes (or longer) and costing a fortune.

Partitioning divides your table into segments based on a specific column’s value (like date, integer range, or ingestion time). When you filter your query using the partitioning column, BigQuery prunes (or "skips") the partitions that don’t match your filter. The most common partitioning is by date, using DATE(timestamp_column). For tables with billions of rows, partitioning by an integer range can also be effective, e.g., PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 1000000000, 10000000)).

Clustering, on the other hand, sorts the data within each partition (or the entire table if not partitioned) based on the values in one or more columns. When your query filters on the clustered columns, BigQuery can efficiently locate the relevant data blocks without scanning unrelated ones. You can cluster by up to four columns. For example, CLUSTER BY country_code, city. This means data for the same country_code and city will be stored together.

The key distinction is how they reduce scans:

  • Partitioning: Divides data into discrete chunks. Filters on the partitioning column allow BigQuery to skip entire chunks.
  • Clustering: Sorts data within a chunk (or the whole table). Filters on clustered columns allow BigQuery to quickly find relevant data blocks within the relevant chunk(s).

Choosing the right partitioning and clustering strategy depends on your query patterns.

  • Partitioning: Best for columns with high cardinality that you frequently filter on in WHERE clauses, especially date/time columns. If you query data from specific date ranges, partition by date. If you query by specific sequential IDs, consider integer range partitioning.
  • Clustering: Best for columns that are frequently used in WHERE clauses (especially alongside partitioning columns) or in JOIN conditions. If you often join tables on customer_id, clustering by customer_id will speed up those joins. If you often filter by country_code and then by user_type, cluster by country_code, user_type.

The actual data isn’t moved into separate physical files when you partition or cluster; BigQuery maintains metadata about how the data is organized. When you run a query, it consults this metadata to determine which data blocks are relevant. This organization is maintained automatically as new data is ingested.

Most people think of partitioning and clustering as ways to select data, but they also significantly impact the performance of aggregations. If you GROUP BY a clustered column, BigQuery can often perform the aggregation much faster because all the values to be grouped are already co-located. For example, if you cluster by product_id and then run SELECT product_id, SUM(sales) FROM ... GROUP BY product_id, BigQuery can do much of the summing locally within data blocks before combining results.

The next thing you’ll grapple with is how to handle schema evolution and the optimal number of partitions.

Want structured learning?

Take the full Bigquery course →