The most surprising true thing about streaming events from Pub/Sub into BigQuery is that it’s not actually "real-time" in the way most people imagine. BigQuery’s streaming inserts have a latency that’s measured in seconds, not milliseconds, and Pub/Sub itself adds its own inherent delay.
Let’s see this in action. Imagine we have a Pub/Sub topic named my-data-stream and we want to load its messages into a BigQuery dataset my_dataset and table my_table.
First, ensure your Pub/Sub messages are structured in a way BigQuery can understand. JSON is common. A message might look like this:
{
"user_id": "user123",
"event_type": "page_view",
"timestamp": "2023-10-27T10:00:00Z",
"page_url": "/products/widget-a"
}
To set this up, we’ll use a Dataflow job. Dataflow is Google Cloud’s managed service for Apache Beam pipelines. We can deploy a pre-built template for this common pattern.
gcloud dataflow jobs run pubsub-to-bigquery \
--gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
--region us-central1 \
--staging-location gs://my-bucket/dataflow-staging \
--parameters \
inputTopic=projects/my-gcp-project/topics/my-data-stream,\
outputTable=my-gcp-project:my_dataset.my_table,\
outputDeadletterTable=my-gcp-project:my_dataset.my_table_errors,\
javascriptTextTransformerTemplateFile=gs://dataflow-templates/latest/PubSub_to_BigQuery_NATIVE_FORMAT.js,\
javascriptTextTransformerFunctionName=process
This command kicks off a Dataflow job that listens to my-data-stream. The javascriptTextTransformerTemplateFile and javascriptTextTransformerFunctionName parameters are crucial if your Pub/Sub messages aren’t directly BigQuery-compatible. The default JavaScript function in the template expects JSON messages and attempts to map them to the BigQuery table schema. If your schema is different, you’ll need to provide your own JavaScript file here to transform the data.
The Dataflow job, acting as the bridge, pulls messages from Pub/Sub, transforms them (if needed, via the JavaScript function), and then uses BigQuery’s streaming insert API to push them into my_dataset.my_table.
Here’s the mental model:
- Pub/Sub: This is your ingestion point. Producers publish messages to a topic. It’s a durable, scalable message queue.
- Dataflow (Beam Pipeline): This is the processing engine. It reads messages from Pub/Sub, applies any necessary transformations (like parsing JSON, enriching data, or filtering), and prepares them for BigQuery.
- BigQuery Streaming Inserts: Dataflow uses this API to write individual records directly into a BigQuery table. This is distinct from batch loading. Each insert has an associated cost.
- BigQuery Table Schema: The structure of your BigQuery table must match or be compatible with the data being streamed. Dataflow will attempt to insert data according to this schema. Mismatches will land messages in the
outputDeadletterTable.
The outputDeadletterTable is a lifesaver. Any message that Dataflow tries to stream but fails due to a schema mismatch, malformed data, or other insertion errors will be sent here. This allows you to inspect problematic records without halting the entire pipeline.
When Dataflow streams data, it uses the tabledata.insertAll API. This API is designed for low-latency inserts, but it’s not free and has quotas. Each row inserted via this API incurs a small cost. If you’re dealing with massive volumes, consider if a batch load (e.g., writing to GCS and then loading into BigQuery) might be more cost-effective for certain types of data. The key here is that Dataflow manages the complexities of retries, error handling, and batching inserts on the BigQuery side for you, making it appear "real-time."
The BigQuery streaming buffer means that queries run immediately after an insert might not see the latest data. BigQuery typically makes streaming data available for querying within seconds, but there’s a small window where it might lag behind the actual insert time.
The next concept you’ll likely wrestle with is how to efficiently query this streaming data, especially when dealing with time-series analysis or needing to combine streaming data with historical batch loads.