BigQuery Omni lets you query data stored in AWS S3 and Azure Data Lake Storage (ADLS) Gen2 as if it were in BigQuery, without moving it.

Let’s say you have a marketing dataset in Azure ADLS Gen2 and a sales dataset in AWS S3, and you want to join them. This is where BigQuery Omni shines.

Here’s a BigQuery Omni setup in action, querying data from both AWS S3 and Azure ADLS Gen2.

First, we need to create an external connection for AWS. This involves setting up an IAM role in your AWS account that BigQuery will assume.

CREATE EXTERNAL CONNECTION `your-gcp-project.your_dataset.aws_connection`
OPTIONS (
  connection_type = 'STORAGE',
  cloud_resource_bucket = 'your-s3-bucket-for-aws-logs', -- This bucket is for audit logs, not your data.
  google_cloud_storage_options = (
    service_account = 'your-aws-iam-role-arn@your-gcp-project.iam.gserviceaccount.com' -- This is a GCP service account that will be granted permissions in AWS
  )
);

The google_cloud_storage_options.service_account here is a GCP service account. You’ll need to grant this GCP service account the Storage Object Viewer role (or equivalent) on the S3 buckets containing your data in AWS. The cloud_resource_bucket is where BigQuery will write audit logs related to the connection, not your actual data.

Next, we create a similar external connection for Azure ADLS Gen2. This will involve creating a Service Principal in Azure AD and granting it appropriate permissions.

CREATE EXTERNAL CONNECTION `your-gcp-project.your_dataset.azure_connection`
OPTIONS (
  connection_type = 'STORAGE',
  cloud_resource_bucket = 'your-azure-container-for-azure-logs', -- This container is for audit logs, not your data.
  azure_storage_options = (
    service_principal_id = 'your-azure-service-principal-id',
    service_principal_secret = 'your-azure-service-principal-secret',
    tenant_id = 'your-azure-tenant-id',
    endpoint = 'https://yourstorageaccount.dfs.core.windows.net'
  )
);

For the Azure connection, you’ll grant the Azure Service Principal (identified by service_principal_id) the Storage Blob Data Reader role on your ADLS Gen2 file system. The endpoint points to your ADLS Gen2 storage account.

Now, we can define external tables that point to our data in AWS and Azure.

For AWS S3:

CREATE EXTERNAL TABLE `your-gcp-project.your_dataset.aws_sales_data`
(
  sale_id INT64,
  product_name STRING,
  sale_amount DECIMAL(10, 2),
  sale_date DATE
)
USING CSV -- or PARQUET, JSON, etc.
OPTIONS (
  connection = 'your-gcp-project.your_dataset.aws_connection',
  format = 'CSV',
  uris = ['s3://your-s3-bucket/path/to/sales_data/*.csv']
);

And for Azure ADLS Gen2:

CREATE EXTERNAL TABLE `your-gcp-project.your_dataset.azure_marketing_data`
(
  campaign_id INT64,
  customer_id STRING,
  campaign_name STRING,
  spend DECIMAL(10, 2),
  campaign_date DATE
)
USING PARQUET -- or CSV, JSON, etc.
OPTIONS (
  connection = 'your-gcp-project.your_dataset.azure_connection',
  format = 'PARQUET',
  uris = ['abfs://your-container@yourstorageaccount.dfs.core.windows.net/path/to/marketing_data/*.parquet']
);

Notice how the uris for Azure use the abfs:// scheme, which is specific to ADLS Gen2.

Finally, we can query across these external tables as if they were native BigQuery tables.

SELECT
  s.product_name,
  m.campaign_name,
  s.sale_amount,
  m.spend,
  s.sale_date
FROM
  `your-gcp-project.your_dataset.aws_sales_data` AS s
JOIN
  `your-gcp-project.your_dataset.azure_marketing_data` AS m
ON
  s.sale_date = m.campaign_date
WHERE
  s.sale_date BETWEEN '2023-01-01' AND '2023-03-31';

This query joins sales data from AWS with marketing campaign data from Azure. BigQuery Omni handles the data access and processing transparently.

The most surprising thing about BigQuery Omni is how it leverages a shared data processing engine across clouds. When you run a query, BigQuery doesn’t just pull data into GCP; it orchestrates processing where the data resides. For AWS, this means utilizing AWS compute resources (like EC2 instances managed by Google) to read and process data directly from S3. Similarly, for Azure, it uses Azure compute to access ADLS Gen2. This distributed processing model is key to its performance and cost-effectiveness, as it minimizes data egress and movement.

The uris option for external tables is more flexible than it initially appears. You can specify wildcards (*) to include multiple files, and you can even list multiple URIs separated by commas to pull data from different locations or files within the same storage system. This allows for granular control over which data is included in your external table definition.

The next hurdle you’ll likely encounter is optimizing query performance when dealing with large datasets across multiple clouds.

Want structured learning?

Take the full Bigquery course →