BigQuery can query data directly from Google Cloud Storage (GCS) files without needing to load it first, and it’s way more performant than you’d expect.
Let’s see it in action. Imagine you have a GCS bucket with some CSV files containing customer data.
gs://my-customer-data-bucket/customers/part-00000.csv
gs://my-customer-data-bucket/customers/part-00001.csv
Each file looks like this:
customer_id,name,email,signup_date
1001,Alice Smith,alice.smith@example.com,2023-01-15
1002,Bob Johnson,bob.johnson@example.com,2023-02-20
To query this as an external table in BigQuery, you first define the table schema. You can do this via the BigQuery UI, bq command-line tool, or client libraries. Here’s how you’d create it using bq:
bq mk --external_table_definition='gs://my-customer-data-bucket/customers/*.csv' \
--schema='customer_id:INTEGER,name:STRING,email:STRING,signup_date:DATE' \
my_dataset.customer_external_table
This command tells BigQuery:
--external_table_definition: The GCS path to your data files. The*.csvwildcard will include all CSV files in that "folder" (GCS objects with that prefix).--schema: The structure of your data. BigQuery needs to know the column names and their data types.my_dataset.customer_external_table: The name of the BigQuery dataset and the external table you’re creating.
Once created, you can query it just like any other BigQuery table:
SELECT
name,
email
FROM
`my_project.my_dataset.customer_external_table`
WHERE
signup_date >= '2023-02-01';
This query will scan the CSV files in GCS, interpret them according to the schema you provided, and return the results.
The core problem this solves is avoiding the overhead of a separate data loading step when your data is already in GCS, or when you want to query data that’s generated directly into GCS (e.g., from Spark or Dataflow jobs). It decouples storage from compute, allowing BigQuery to access data where it lives.
Internally, when you run a query against an external table, BigQuery doesn’t actually move the data. Instead, it uses its distributed query engine to read directly from the GCS objects. For formats like CSV and JSON, BigQuery parses the data on the fly. For more structured formats like Parquet and ORC, it can read metadata and only scan the necessary row groups, significantly improving performance. BigQuery’s query engine is designed to be highly parallel, so it can read from many GCS files concurrently, distributing the work across its nodes.
The exact levers you control are primarily the GCS path, the schema definition, and the file format. BigQuery supports CSV, JSON (newline-delimited), Avro, Parquet, and ORC. For CSV, you can specify delimiters, quote characters, and whether the first row is a header. For JSON, you can specify whether it’s line-delimited or requires schema inference. The performance heavily depends on the file format, with columnar formats like Parquet offering the best "predicate pushdown" capabilities, meaning BigQuery can sometimes filter data at the GCS level before reading it into memory.
When using CSV or JSON, BigQuery must parse every byte of the file that matches your query’s filter conditions. However, with columnar formats like Parquet or ORC, BigQuery can leverage schema information embedded within the files to skip reading entire blocks of data that don’t contain relevant columns or rows. This means if you’re querying only a few columns from a massive Parquet file, BigQuery might only read a fraction of the total data size, a capability known as "column pruning" and "row group filtering."
The next step is exploring how to optimize performance for large-scale external tables, particularly with different file formats and partitioning strategies.