Dataform allows you to manage your BigQuery SQL pipelines with source control, automated testing, and scheduled deployments.
Let’s see it in action. Imagine you have a raw data table in BigQuery called raw_events and you want to create a transformed table processed_events that aggregates events by user and date.
-- dataform/definitions/processed_events.sqlx
config {
type: "table",
schema: "analytics"
}
SELECT
user_id,
DATE(event_timestamp) AS event_date,
COUNT(*) AS event_count
FROM
${ref("raw_events")}
GROUP BY
user_id,
event_date
Here, config specifies that processed_events will be a BigQuery table in the analytics schema. ${ref("raw_events")} is a Dataform-specific function that generates the correct BigQuery table reference for raw_events, ensuring your SQL is portable across different environments (like development vs. production).
Dataform compiles this sqlx file into pure SQL and executes it against your BigQuery project. When you run dataform run, it will create or update the analytics.processed_events table.
The core problem Dataform solves is the complexity of managing SQL-based data transformations. Traditionally, this involves:
- Manual SQL execution: Running scripts one by one, often with errors and inconsistencies.
- Lack of version control: Difficult to track changes, revert to previous states, or collaborate effectively.
- No automated testing: Data quality issues go undetected until they impact downstream systems.
- Complex scheduling: Orchestrating dependencies and execution times manually or with separate tools.
Dataform addresses these by providing a unified workflow. You define your data transformations as SQLX files (or SQL files) within a Dataform project. This project is essentially a Git repository.
Here’s how the mental model breaks down:
-
Definitions: You write SQLX files that define your data assets (tables, views, assertions). Each file describes what the asset should look like and how to build it.
table: Creates a physical table in BigQuery. Good for aggregated or materialized data.view: Creates a logical view in BigQuery. Efficient for reusable query logic without materialization.incremental: Creates or updates a table by only processing new or changed data since the last run. This is crucial for performance on large datasets.assertion: Defines SQL queries that check for data quality conditions (e.g., no null primary keys, values within a range).
-
Dependencies: Dataform automatically infers dependencies between your SQLX files. If
processed_events.sqlxreferences${ref("raw_events")}, Dataform knows thatraw_eventsmust be built beforeprocessed_events. This forms a directed acyclic graph (DAG) of your transformations. -
Compilation: When you run
dataform compile, Dataform takes your SQLX files, resolves references, applies configurations, and generates pure BigQuery SQL. This is what actually gets executed. -
Execution: The
dataform runcommand compiles the project and then executes the generated SQL against your BigQuery project, creating or updating your tables and views according to the defined dependencies. -
Testing:
dataform runalso executes your assertions. If any assertion fails (returns rows), the entiredataform runcommand fails, preventing bad data from propagating. -
Versioning & Collaboration: Because your Dataform project is a Git repository, you can use standard Git workflows: branching for new features, pull requests for code reviews, and merging for integration. This provides a robust audit trail and simplifies collaboration.
-
Scheduling: Dataform integrates with workflow orchestrators like Cloud Composer (Airflow) or can be scheduled directly in Google Cloud Workflows or Cloud Scheduler to run your
dataform runcommands on a recurring basis.
Let’s look at an incremental table. Suppose raw_events is a massive, append-only table and you want to update processed_events efficiently.
-- dataform/definitions/processed_events_incremental.sqlx
config {
type: "incremental",
schema: "analytics",
uniqueKey: ["user_id", "event_date"] // For deduplication and updates
}
SELECT
user_id,
DATE(event_timestamp) AS event_date,
COUNT(*) AS event_count
FROM
${ref("raw_events")}
WHERE
${when(incremental(), `event_timestamp > (SELECT MAX(event_timestamp) FROM ${self()})`)}
GROUP BY
user_id,
event_date
The incremental type, combined with the uniqueKey and the event_timestamp > (SELECT MAX(event_timestamp) FROM ${self()}) clause, tells Dataform to only process events from raw_events that occurred after the last time processed_events_incremental was built. ${self()} refers to the current table being defined. This drastically reduces query costs and execution time for large, frequently updated datasets.
The compilation process handles the incremental() condition, generating SQL that either creates the table from scratch (on the first run) or inserts/updates based on the WHERE clause and uniqueKey for subsequent runs.
When you define assertions, you’re essentially creating data quality checks that become part of your pipeline. For example, to ensure user_id is never null in processed_events:
-- dataform/definitions/assertions/processed_events_user_id_not_null.sqlx
config {
type: "assertion",
schema: "analytics",
dependencies: ["processed_events"]
}
SELECT
user_id
FROM
${ref("processed_events")}
WHERE
user_id IS NULL
If this assertion runs and returns any rows (meaning there are indeed null user_ids), the dataform run command will fail. The dependencies config ensures this assertion runs only after processed_events is created.
Dataform’s ability to generate SQL based on your definitions, manage dependencies, and integrate testing into the workflow provides a robust and maintainable way to build complex data pipelines in BigQuery.
The next step is often managing different environments (dev, staging, prod) for your Dataform project, which involves configuring Dataform’s compilation and execution to target different BigQuery datasets or projects and managing secrets for authentication.