BigQuery costs are a function of data scanned, not data stored, and it’s shockingly easy to burn money scanning entire massive tables when you only need a sliver.

Let’s see BigQuery in action, but not with a toy example. Imagine a real-world scenario: analyzing user activity logs for a popular e-commerce site. We’ve got a table, user_activity, with billions of rows, each representing an event like page_view, add_to_cart, or purchase.

-- This query looks innocent enough...
SELECT
  event_timestamp,
  user_id,
  event_type,
  product_id
FROM
  `my-project.my_dataset.user_activity`
WHERE
  event_type = 'purchase'
  AND DATE(event_timestamp) BETWEEN '2023-10-01' AND '2023-10-31'
ORDER BY
  event_timestamp DESC
LIMIT 100;

If you run this without any optimization, BigQuery might scan trillions of bytes. That’s because, by default, it has to read through the entire user_activity table to find the rows matching your WHERE clause, even though most of those rows are irrelevant to this specific query. The LIMIT 100 only applies after all the data is scanned and processed.

The core problem BigQuery solves is making massive datasets queryable interactively. It achieves this through a distributed, columnar storage and processing architecture. When you query, it doesn’t read whole rows like a traditional database. Instead, it reads only the columns you request (event_timestamp, user_id, etc.) from disk. This is a massive win for performance and cost, but only if the amount of data read from those columns is minimized.

Here are 10 ways to keep those costs down without crippling your query speed:

  1. Partition Your Tables: This is the single biggest lever. Partitioning splits your table into smaller segments based on a column’s value, typically a date or timestamp. When you query a partitioned table with a filter on the partitioning column, BigQuery only scans the relevant partitions.

    • Diagnosis: Check your table schema. If you don’t see a partitioning column defined, you’re not partitioned.
    • Fix: Recreate your table with partitioning. For daily partitioning on a timestamp column named event_timestamp:
      CREATE TABLE `my-project.my_dataset.user_activity_partitioned`
      PARTITION BY DATE(event_timestamp) AS
      SELECT * FROM `my-project.my_dataset.user_activity`;
      
      Then, update your queries to use this new table.
    • Why it works: BigQuery prunes (skips) entire partitions that don’t match the WHERE clause, drastically reducing data scanned.
  2. Cluster Your Tables: Clustering sorts data within partitions (or the whole table if unpartitioned) based on one or more columns. This co-locates similar data, allowing BigQuery to efficiently scan only the necessary blocks of data when filtering on clustered columns.

    • Diagnosis: Again, check your table schema. Look for "Clustering columns."
    • Fix: Add clustering to your partitioned table. If you frequently filter user_id and event_type:
      CREATE TABLE `my-project.my_dataset.user_activity_partitioned_clustered`
      PARTITION BY DATE(event_timestamp)
      CLUSTER BY user_id, event_type AS
      SELECT * FROM `my-project.my_dataset.user_activity_partitioned`;
      
    • Why it works: Clustering improves filter efficiency for columns within a partition, and can also speed up queries that don’t filter on the partitioning column but do filter on clustered columns.
  3. Use SELECT * Sparingly: Even with partitioning and clustering, SELECT * forces BigQuery to read all columns for the matching rows. If you only need a few columns, explicitly list them.

    • Diagnosis: Look for SELECT * in your queries.
    • Fix: Replace * with the specific columns you need.
      SELECT
        event_timestamp,
        user_id,
        event_type,
        product_id
      FROM ... -- (your partitioned/clustered table)
      
    • Why it works: BigQuery is columnar. It only reads the data for the columns you ask for. Less columns = less data scanned.
  4. Leverage WHERE Clauses Aggressively: The more specific your WHERE clause, the less data BigQuery needs to scan. Ensure your filters are applied as early as possible in the query.

    • Diagnosis: Review your query logic. Are there implicit filters or conditions that could be made explicit?
    • Fix: Add WHERE clauses for any relevant criteria. For example, if you only care about mobile users:
      SELECT ...
      FROM ...
      WHERE
        event_type = 'purchase'
        AND DATE(event_timestamp) BETWEEN '2023-10-01' AND '2023-10-31'
        AND device_type = 'mobile' -- Added filter
      
    • Why it works: More granular filters reduce the initial set of rows that BigQuery needs to consider, even before column pruning.
  5. Use DATE() or TIMESTAMP() Functions Wisely: While convenient, applying functions to partitioning or clustering columns in your WHERE clause can prevent BigQuery from using those optimizations.

    • Diagnosis: Notice DATE(event_timestamp) in the WHERE clause in the example. This can be a performance killer on a partitioned table where event_timestamp is the partitioning column.
    • Fix: For daily partitioning on a timestamp column, filter directly on a date literal:
      SELECT ...
      FROM `my-project.my_dataset.user_activity_partitioned`
      WHERE
        event_type = 'purchase'
        AND event_timestamp >= '2023-10-01 00:00:00 UTC' -- Use timestamp literals
        AND event_timestamp < '2023-11-01 00:00:00 UTC' -- Use timestamp literals
      
      If you are daily partitioning, you can even use a date literal if your partition is DATE(timestamp_col):
      SELECT ...
      FROM `my-project.my_dataset.user_activity_partitioned`
      WHERE
        event_type = 'purchase'
        AND DATE(event_timestamp) BETWEEN '2023-10-01' AND '2023-10-31'
      
      (Self-correction: The original example was using DATE(event_timestamp) BETWEEN ... which is fine for daily partitioning. The key is to avoid applying the function if the partition column is not the partitioning column, or if you’re trying to filter on a different precision than the partition. The most efficient way is often to filter directly on the partition column’s type if possible.)
    • Why it works: Applying functions to the column in the WHERE clause can force BigQuery to scan all data and then apply the function, rather than using the partition/cluster metadata directly. Filtering on the raw column type, where possible, allows BigQuery to leverage its indexing and pruning capabilities.
  6. Utilize Table Wildcards for Related Tables: If you have time-series data split into daily tables (e.g., user_activity_20231001, user_activity_20231002), use table wildcards. BigQuery can then prune entire tables from the scan.

    • Diagnosis: Queries that UNION ALL many tables or use explicit OR conditions on table names.
    • Fix: Use a wildcard.
      SELECT
        event_timestamp,
        user_id,
        event_type
      FROM
        `my-project.my_dataset.user_activity_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20231001' AND '20231031' -- Prunes tables
        AND event_type = 'purchase';
      
    • Why it works: BigQuery uses the _TABLE_SUFFIX (or other wildcard conditions) to identify and scan only the tables that match the criteria, not all tables matching the pattern.
  7. Avoid ORDER BY Without LIMIT: ORDER BY without a LIMIT clause forces BigQuery to sort all the results before returning them, which can be very expensive.

    • Diagnosis: Queries with ORDER BY but no LIMIT.
    • Fix: Add a LIMIT clause if you only need a subset of sorted data. If you need all sorted data, consider if it’s truly necessary or if you can process it in a downstream system.
      SELECT ...
      FROM ...
      WHERE ...
      ORDER BY event_timestamp DESC
      LIMIT 100; -- Added LIMIT
      
    • Why it works: LIMIT tells BigQuery it can stop processing once it has found enough rows, rather than needing to scan and sort the entire dataset.
  8. Use Approximate Aggregations When Precision Isn’t Critical: For functions like COUNT(DISTINCT ...), BigQuery offers approximate versions that are significantly faster and cheaper.

    • Diagnosis: Queries using COUNT(DISTINCT col).
    • Fix: Use APPROX_COUNT_DISTINCT(col).
      SELECT
        APPROX_COUNT_DISTINCT(user_id) AS distinct_users
      FROM
        `my-project.my_dataset.user_activity`
      WHERE
        DATE(event_timestamp) = '2023-10-26';
      
    • Why it works: Approximate functions use probabilistic data structures (like HyperLogLog++) that trade a small margin of error for massive performance and cost gains.
  9. Denormalize and Flatten Data: While sometimes counterintuitive, for analytical workloads, denormalizing (embedding related data) and flattening (removing nested structures) can be more cost-effective than joining multiple tables.

    • Diagnosis: Queries that involve frequent, complex joins across many tables.
    • Fix: Design your fact tables to include necessary dimension attributes directly. For example, instead of joining purchases to a products table, include product_name and product_category directly in the purchases table.
    • Why it works: Joins in BigQuery can be expensive as they involve shuffling data. Having data already co-located in a single, wider table reduces scan and shuffle costs.
  10. Query Smaller, Aggregated Tables First: If you need high-level summaries, build and query aggregated tables instead of running complex aggregations on raw data every time.

    • Diagnosis: Frequent queries that aggregate large amounts of raw data for reporting.
    • Fix: Create materialized views or scheduled query jobs that pre-aggregate data into summary tables.
      -- Example of a scheduled query to create a daily summary
      CREATE OR REPLACE TABLE `my-project.my_dataset.daily_sales_summary`
      PARTITION BY sales_date AS
      SELECT
        DATE(event_timestamp) AS sales_date,
        COUNT(DISTINCT user_id) AS distinct_buyers,
        SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS total_purchases,
        SUM(purchase_amount) AS total_revenue
      FROM
        `my-project.my_dataset.user_activity`
      WHERE
        event_type = 'purchase'
      GROUP BY 1;
      
      Then query daily_sales_summary for your reports.
    • Why it works: You’re scanning and processing data once to create the aggregate, and then subsequent queries scan only the much smaller aggregate table.

The next error you’ll hit is likely a "Query exceeding the maximum allowed execution time" or a "Resources exceeded during query execution" if you try to scan too much data without proper partitioning or filtering.

Want structured learning?

Take the full Bigquery course →