BigQuery data models are often treated as mere ETL pipelines, but they’re actually the foundational logic of your entire analytics stack, dictating how every downstream dashboard and report interprets your data.
Let’s see dbt in action, modeling some raw sales data into a clean, usable dim_customers table.
Imagine you have a table like this in BigQuery:
-- raw_sales.orders
CREATE TABLE raw_sales.orders (
order_id INT64,
customer_id INT64,
order_date DATE,
amount NUMERIC
);
And another for customer details:
-- raw_sales.customers
CREATE TABLE raw_sales.customers (
customer_id INT64,
first_name STRING,
last_name STRING,
email STRING,
signup_date DATE
);
We want to build a dim_customers model that joins these, adds some useful flags, and ensures we have a unique customer record. In dbt, this looks like:
-- models/staging/stg_customers.sql
SELECT
customer_id,
first_name,
last_name,
email,
signup_date,
-- Add a flag for early adopters
CASE WHEN signup_date < '2022-01-01' THEN TRUE ELSE FALSE END AS is_early_adopter
FROM {{ source('raw_sales', 'customers') }}
-- models/marts/dim_customers.sql
SELECT
c.customer_id,
c.first_name,
c.last_name,
c.email,
c.signup_date,
c.is_early_adopter,
-- Calculate the number of orders for each customer
COUNT(o.order_id) AS total_orders,
-- Get the date of their first order
MIN(o.order_date) AS first_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ source('raw_sales', 'orders') }} o
ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.first_name,
c.last_name,
c.email,
c.signup_date,
c.is_early_adopter
When you run dbt run, dbt compiles these SQL files into actual BigQuery tables or views. By default, dim_customers will be created as a view. You can change this in your dbt_project.yml or directly in the model file:
-- models/marts/dim_customers.sql
{{ config(materialized='table') }}
SELECT
c.customer_id,
c.first_name,
c.last_name,
c.email,
c.signup_date,
c.is_early_adopter,
COUNT(o.order_id) AS total_orders,
MIN(o.order_date) AS first_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ source('raw_sales', 'orders') }} o
ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.first_name,
c.last_name,
c.email,
c.signup_date,
c.is_early_adopter
This dim_customers table now serves as a single source of truth for customer information. Any dashboard that needs customer names, emails, or order counts can simply query this dbt model, rather than wrestling with raw tables and complex joins. The ref() function ensures that if stg_customers changes, dim_customers will be rebuilt correctly.
The core problem dbt solves is managing complexity and ensuring consistency in your data transformation logic. It enforces a modular, version-controlled, and testable approach to building data models. By using sources and refs, dbt builds a dependency graph of your entire data warehouse, allowing you to visualize and understand how your data flows from raw inputs to curated outputs.
A key best practice is to organize your models into layers: staging for initial cleaning and renaming of raw data, intermediate for complex, reusable transformations that might be used across multiple models, and marts for the final, business-facing tables (like dimensions and facts) that power your BI tools. This layered approach makes your dbt project scalable and maintainable.
When materializing models, consider the trade-offs. Views are great for development and when data freshness is paramount, as they always query the underlying tables. Tables, however, offer better query performance for complex models or those queried very frequently, but require a refresh strategy. Incremental models are the sweet spot for large fact tables, only processing new or updated records since the last run, significantly reducing query costs and run times.
The dbt test command is your best friend for data quality. Beyond simple not_null or unique tests, you can write custom SQL tests to assert complex business rules. For example, ensuring that total_orders in dim_customers is always greater than or equal to zero, or that first_order_date is always on or after signup_date. These tests run automatically after your models are built, catching data quality issues before they propagate downstream.
The most powerful, yet often overlooked, aspect of dbt’s materialization strategy is its handling of incremental models. When you define an incremental model, dbt doesn’t just rebuild the entire table each time. Instead, it intelligently generates SQL that selects only the data that has changed or is new since the last time the model was run, based on a unique_key and a timestamp column you define. This is crucial for managing costs and performance in cloud data warehouses like BigQuery, where processing large volumes of data can be expensive.
The next step in your dbt journey involves mastering incremental model development and implementing robust data quality testing strategies.