The BigQuery Storage Write API is a game-changer for high-throughput data ingestion, but it’s not just a faster way to dump data. It fundamentally changes how you think about streaming versus batch, turning almost any ingestion into a single, unified API.
Let’s watch it in action. Imagine we have a constant stream of clickstream events from a web application, and we want to land them in BigQuery for analysis.
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1.types import AppendRowsRequest, Table
def write_to_bigquery(project_id: str, dataset_id: str, table_id: str, rows: list[dict]):
client = bigquery_storage_v1.BigQueryWriteClient()
table_ref = f"projects/{project_id}/datasets/{dataset_id}/tables/{table_id}"
# Create a streaming insert stream
parent = f"projects/{project_id}/datasets/{dataset_id}/tables/{table_id}"
response = client.create_write_stream(
request=bigquery_storage_v1.CreateWriteStreamRequest(
parent=parent,
write_stream=bigquery_storage_v1.WriteStream(
type_=bigquery_storage_v1.WriteStream.Type.COMMITTED,
# For batch, we'd use TYPE_BUFFERED and commit later.
# COMMITTED means data is immediately available after commit.
),
)
)
write_stream_id = response.name
print(f"Created write stream: {write_stream_id}")
# Convert rows to a format the API expects (e.g., JSON lines)
# For simplicity, assuming schema is known and compatible.
# In a real scenario, you'd serialize based on your table schema.
data = "\n".join([json.dumps(row) for row in rows])
data_bytes = data.encode("utf-8")
# Append rows
request = AppendRowsRequest(
write_stream=write_stream_id,
rows=AppendRowsRequest.Rows(
serialized_rows=data_bytes,
),
)
response = client.append_rows(request)
# The response will contain stream position information.
# For COMMITTED streams, the stream position indicates the commit offset.
# A successful append means data is being written.
print(f"Appended data. Stream position: {response.stream_position}")
# In a real batch scenario, you'd close the stream and commit if buffered.
# For COMMITTED, it's implicitly committed upon successful append.
# To explicitly commit a buffered stream:
# client.finalize_write_stream(FinalizeWriteStreamRequest(write_stream=write_stream_id))
# client.commit_write_stream(CommitWriteStreamRequest(write_stream=write_stream_id))
# Example usage:
# Replace with your actual project, dataset, and table IDs
# project_id = "your-gcp-project-id"
# dataset_id = "your_dataset"
# table_id = "your_table"
#
# sample_rows = [
# {"user_id": "user123", "event_time": "2023-10-27T10:00:00Z", "page": "/home"},
# {"user_id": "user456", "event_time": "2023-10-27T10:01:15Z", "page": "/products"},
# ]
#
# write_to_bigquery(project_id, dataset_id, table_id, sample_rows)
This code snippet illustrates the core mechanics. You create a WriteStream, which is a logical connection to your BigQuery table. Then, you AppendRows to that stream. The StorageWriteAPI handles the heavy lifting of buffering, deduplication (if enabled), and committing data to BigQuery.
The key difference between "streaming" and "batch" with this API lies in the WriteStream.Type and how you manage the stream’s lifecycle.
-
COMMITTEDType (Streaming-like): When you useWriteStream.Type.COMMITTED, eachAppendRowscall that succeeds means the data is immediately available for querying in BigQuery. This is ideal for low-latency, real-time data. You don’t explicitly "commit" in the traditional batch sense; successful appends are committed. The stream is implicitly managed. -
BUFFEREDType (Batch-like): If you setWriteStream.Type.BUFFERED, data appended to the stream is held in a buffer. It’s not queryable until you explicitlyFinalizeWriteStreamand thenCommitWriteStream. This is where the "batch" aspect comes in. You can accumulate data over a period, then finalize and commit it all at once. This is more cost-effective for large volumes of data that don’t require immediate visibility. TheFinalizeWriteStreamcall signals that no more data will be appended to this stream, andCommitWriteStreammakes the accumulated data visible.
The Storage Write API solves the fundamental problem of efficient, high-throughput data ingestion into BigQuery, bridging the gap between traditional streaming (like the older insertAll API) and batch loading. It offers exactly-once semantics, at-least-once semantics, and deduplication, giving you fine-grained control. It operates by establishing a persistent connection (the write stream) and allowing multiple clients to append data concurrently. BigQuery then manages the internal buffering, sorting, and committing of this data into immutable storage.
The real power comes from the fact that the API abstracts away the complexity of managing individual BigQuery shards and partitions. You simply send your data, and the API ensures it lands correctly and efficiently. You control the data availability and throughput by choosing between COMMITTED and BUFFERED streams and by how frequently you call Finalize and Commit for buffered streams.
What most people don’t realize is that even with COMMITTED streams, the API is still performing complex internal batching and commit operations behind the scenes. Your AppendRows calls are being aggregated into larger batches before being written to BigQuery’s storage. The COMMITTED type simply means the commit happens as soon as BigQuery’s internal batching threshold is met for that stream, making it appear real-time to you.
When you use BUFFERED streams and commit them, you’re essentially creating mini-batch loads that are much more efficient than traditional file-based batch loads because they avoid intermediate storage (like GCS) and the overhead of scheduling separate load jobs. The next step is understanding how to manage multiple streams for parallel ingestion and how to handle retries and error conditions gracefully.