Terraform can manage BigQuery datasets, but its declarative nature often obscures the imperative reality of dataset creation and mutation.
Let’s spin up a BigQuery dataset and see how Terraform handles it. We’ll define a simple google_bigquery_dataset resource in a .tf file:
provider "google" {
project = "your-gcp-project-id"
region = "us-central1"
}
resource "google_bigquery_dataset" "my_dataset" {
dataset_id = "example_dataset"
location = "US"
description = "This is an example dataset managed by Terraform."
labels = {
environment = "dev"
managed_by = "terraform"
}
}
Now, if we run terraform apply, Terraform will call the Google Cloud API to create this dataset. The dataset_id is the unique identifier within your project. The location dictates where the data physically resides (e.g., US, EU, asia-east1). The description and labels are metadata that help organize and identify your datasets.
When you execute terraform apply, Terraform doesn’t just "create a dataset." It performs a sequence of API calls. First, it checks if a dataset with the ID example_dataset already exists in your project. If it doesn’t, it sends a POST request to https://bigquery.googleapis.com/bigquery/v2/projects/your-gcp-project-id/datasets with the dataset configuration. If it does exist, Terraform will compare the existing dataset’s configuration with your .tf file. If they differ (e.g., a label changed, the description was updated), it will send a PUT request to update the existing dataset.
The problem Terraform solves here is idempotency and state management. Instead of writing scripts to check existence, create, or update, you declare the desired state, and Terraform figures out the imperative steps. It keeps track of the resources it manages in a state file, so it knows what to do on subsequent runs.
The google_bigquery_dataset resource has several key arguments:
dataset_id: The name of the dataset. This is immutable after creation.project: The GCP project ID. If not provided, it defaults to the project configured in thegoogleprovider.location: The geographical location for the dataset. This is also immutable.description: A human-readable explanation of the dataset.default_table_expiration_ms: Sets a default expiration time for tables created in this dataset. Tables will be automatically deleted after this duration in milliseconds.default_partition_expiration_ms: Sets a default expiration for partitions within partitioned tables.access: This is where you define who can access the dataset and what permissions they have. You can grant access to users, service accounts, or entire domains. For example:access { role = "READER" user_by_email = "user@example.com" } access { role = "WRITER" special_group = "projectWriters" }labels: Key-value pairs for organizing and filtering resources.
Let’s look at the access block more closely. When you specify access blocks, Terraform doesn’t just add these entries. It fetches the current access entries for the dataset and then merges your desired entries with the existing ones, sending a full replacement list to the API. This is a common point of confusion: if another process or manual change modifies the access controls outside of Terraform, terraform apply might unexpectedly revert those changes because it’s enforcing the state defined in your configuration.
The most surprising true thing about BigQuery dataset management with Terraform is that while you declare the state of the dataset, Terraform’s access block management is fundamentally a replace operation, not an additive one. If you define three access blocks in your Terraform configuration, Terraform will ensure only those three access entries exist on the dataset. Any pre-existing access entries not declared in your Terraform configuration will be removed. This means you need to be very careful about managing access explicitly within your Terraform code if you want to avoid unintended data exposure or lockdown.
The next concept you’ll likely encounter is managing BigQuery tables within these datasets using Terraform resources like google_bigquery_table.