BigQuery can query Cloud Spanner, but it’s not just a simple JOIN across two databases; it’s a managed data integration that requires explicit configuration for security and performance.

Let’s see it in action. Imagine you have a BigQuery dataset my_project.my_dataset and you want to query a Spanner instance named my-spanner-instance in the us-central1 region, with a database my-spanner-db.

First, you need to create a connection in BigQuery. This connection acts as a secure bridge.

bq mk \
--connection \
--connection_type=CLOUD_SPANNER \
--project_id=my-project \
--location=us \
my_spanner_connection

This command creates a connection object named my_spanner_connection in the us multi-region for your project my-project. The CLOUD_SPANNER type tells BigQuery what kind of external data source it’s dealing with.

Next, you need to grant the BigQuery service account the necessary permissions on your Spanner instance. The service account ID will look something like bq-PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com. You can find your project number in the Google Cloud Console or by running gcloud config list project --format='value(core.project)' and then looking up the corresponding number.

On your Spanner instance, grant the roles/spanner.databaseReader role to this service account for the specific database you want to query.

gcloud spanner databases add-iam-policy-binding my-spanner-db \
--instance=my-spanner-instance \
--member='serviceAccount:bq-PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com' \
--role='roles/spanner.databaseReader' \
--project=my-project

Now, you can create a foreign table in BigQuery that points to your Spanner table. Let’s say your Spanner table is named customers and has columns customer_id (STRING) and name (STRING).

CREATE EXTERNAL TABLE my_project.my_dataset.spanner_customers (
  customer_id STRING,
  name STRING
)
OPTIONS (
  connection = 'my-project.us.my_spanner_connection',
  table = 'customers',
  database = 'my-spanner-db',
  instance = 'my-spanner-instance',
  project = 'my-project'
);

The connection option specifies the BigQuery connection resource you created. The table, database, instance, and project options tell BigQuery exactly where to find the data within Spanner.

With this setup, you can now query my_project.my_dataset.spanner_customers just like any other BigQuery table.

SELECT customer_id, name
FROM my_project.my_dataset.spanner_customers
WHERE name LIKE 'A%';

BigQuery will translate this SQL query into a Spanner query, execute it against Spanner, and stream the results back to BigQuery for further processing or storage. This is particularly useful for ad-hoc analytics on operational Spanner data without needing to duplicate it.

The performance of these queries is heavily influenced by how well the Spanner schema is designed for the types of queries you’re running. Spanner’s primary keys are crucial; BigQuery will often push down filters to Spanner, and if those filters align with Spanner’s primary key or secondary indexes, the query will be much faster. Without good indexing in Spanner, BigQuery might have to perform full table scans on the Spanner side, which can be slow and expensive.

When BigQuery queries Spanner, it doesn’t load the entire Spanner table into BigQuery’s storage. Instead, it establishes a connection and fetches data in batches as needed. This means that changes in Spanner are reflected almost immediately in BigQuery queries, providing a near real-time view of your operational data. However, this also means that the latency of your BigQuery query is directly tied to Spanner’s performance and network conditions between the two services.

The connection object itself is a metadata entry within BigQuery. It stores the configuration and the reference to a service account that BigQuery will impersonate when accessing Spanner. This service account must be granted the appropriate roles on the Spanner instance and database. The BigQuery connection doesn’t contain credentials itself; rather, it leverages Google Cloud’s IAM system to authorize the BigQuery service account to access Spanner resources. This separation of concerns enhances security by avoiding the need to manage explicit credentials.

The CREATE EXTERNAL TABLE statement defines a schema in BigQuery that maps to the Spanner table. BigQuery uses this schema to understand how to interpret the data coming from Spanner and to translate BigQuery SQL into Spanner SQL. Any discrepancies between the BigQuery schema definition and the actual Spanner table schema can lead to query errors or unexpected results. For instance, if you define a column as STRING in BigQuery but it’s a BYTES type in Spanner and contains non-UTF8 data, you’ll encounter issues.

Understanding the underlying Spanner data model, especially its primary keys and secondary indexes, is paramount for optimizing queries. BigQuery’s query optimizer attempts to push down as much filtering and projection as possible to Spanner. If your Spanner table lacks appropriate indexes to support the WHERE clauses or SELECT lists in your BigQuery queries, Spanner will have to perform full table scans, which can be inefficient and costly. This often leads to longer query execution times and higher Spanner costs.

The next step is to explore how to push down aggregate functions and perform complex joins between Spanner tables and BigQuery native tables.

Want structured learning?

Take the full Bigquery course →