The BigQuery Storage Read API doesn’t just speed up reading data; it fundamentally changes how data is accessed, moving from row-by-row fetching to bulk, parallelized streams of compressed, columnar data.

Let’s see it in action. Imagine you have a table my_dataset.my_table with millions of rows. The traditional way to read it, say into a Pandas DataFrame, might look like this:

from google.cloud import bigquery

client = bigquery.Client()
query = "SELECT * FROM `my_project.my_dataset.my_table` LIMIT 1000000"
df = client.query(query).to_dataframe()

This works, but it’s slow. The BigQuery API sends data back to the client in a format that’s then serialized and deserialized, often involving many network round trips and CPU-intensive processing on the client side.

Now, with the Storage Read API, the approach is different. We’re going to use the google-cloud-bigquery-storage library.

First, you need to set up a BigQueryReadClient and establish a ReadSession. A ReadSession is a snapshot of the data you want to read, including filters and the specific columns.

from google.cloud.bigquery_storage_v1.types import (
    ReadSession,
    DataFormat,
    TableReadOptions,
)
from google.cloud.bigquery_storage_v1 import BigQueryReadClient

project_id = "your-gcp-project-id"
table_id = "your-gcp-project-id.my_dataset.my_table"

read_client = BigQueryReadClient()

# Define the table and columns you want to read
table_reference = f"projects/{project_id}/datasets/my_dataset/tables/my_table"
request = ReadSession.CreateReadSessionRequest(
    parent=f"projects/{project_id}",
    read_session=ReadSession(
        table_active_version=table_reference,
        data_format=DataFormat.ARROW,  # Or PARQUET
        read_options=TableReadOptions(
            selected_fields=["col1", "col2", "col3"] # Specify columns if not all needed
        ),
    ),
)

session = read_client.create_read_session(request=request)

The session object now contains a streams attribute, which is a list of Stream objects. Each Stream represents a distinct portion of your data that can be read in parallel. The name attribute of each stream is what you’ll use to actually read the data.

from google.cloud.bigquery_storage_v1 import types

# Get the first stream's name
stream_name = session.streams[0].name

# Now, request data from this stream
request = types.ReadRowsRequest(read_stream=stream_name)
rows = read_client.read_rows(request)

# Process the rows. The format is ARROW by default.
for row_batch in rows.rows():
    # row_batch is a pyarrow.RecordBatch
    print(f"Received a batch with {row_batch.num_rows} rows.")
    # You can convert this to pandas:
    # df_chunk = row_batch.to_pandas()
    # Or process the data directly from the Arrow RecordBatch

The magic here is that read_rows returns an iterator yielding Row objects, which are essentially Arrow RecordBatches. These are highly compressed, columnar data structures. Instead of BigQuery serializing rows into JSON or Protobuf for HTTP transfer, it sends raw, compressed Arrow (or Parquet) data directly over gRPC. Your client library then decodes this directly into memory.

This bypasses a lot of overhead:

  1. No HTTP Overhead: gRPC is more efficient for streaming binary data than repeated HTTP requests.
  2. Columnar Format: You’re reading columns, not entire rows. If you only need col1 and col3, only that data is transferred and processed.
  3. Compression: Arrow and Parquet are compressed formats.
  4. Parallelism: You can read from multiple streams concurrently, distributing the load across your client’s cores and network bandwidth.

The ReadSession itself is a resource that has a lifespan. By default, it’s 24 hours, but it’s good practice to manage it. The create_read_session call effectively takes a snapshot of your table at that moment. If the table is updated after the session is created, your read session will reflect the data as it was when the session was initiated.

The data_format parameter is crucial. DataFormat.ARROW is the default and often the fastest for direct integration with Python libraries like Pandas and PyArrow. DataFormat.PARQUET is also supported and can be beneficial if you’re integrating with systems that natively prefer Parquet.

The selected_fields in TableReadOptions is a performance multiplier. If your table has 100 columns but you only need 5, specifying those 5 drastically reduces I/O and network transfer.

The biggest surprise for many is how much of the "slowness" in traditional BigQuery reads was due to the client-side processing and serialization/deserialization layers, not necessarily the BigQuery backend’s ability to serve data. The Storage Read API pushes more of the data processing and format handling to the client, but in a much more efficient, bulk-oriented, and parallelized way.

Once you’ve consumed all the data from all streams in your ReadSession, you’ll eventually want to clean up. While sessions expire, explicitly closing them can be good practice if you’re managing many short-lived sessions. However, the primary mechanism for managing session resources is through their expiration. The next challenge is typically handling the full lifecycle of these sessions in a distributed processing environment, like Spark or Dask, where managing parallel reads across many workers becomes critical.

Want structured learning?

Take the full Bigquery course →