BigQuery Stored Procedures are essentially SQL queries that you can save and reuse, but their real power comes from their ability to accept parameters and return values, making them feel more like traditional programming functions than just saved SQL.
Let’s see one in action. Imagine you have a table of sales with product_id, quantity, and sale_date. You want to find the total quantity sold for a specific product on a given date.
CREATE OR REPLACE PROCEDURE sales_analysis.get_product_sales_by_date(
IN product_id_param INT64,
IN sale_date_param DATE,
OUT total_quantity INT64
)
BEGIN
SELECT SUM(quantity) INTO total_quantity
FROM sales_analysis.sales
WHERE product_id = product_id_param
AND sale_date = sale_date_param;
END;
To run this, you’d use the CALL statement:
CALL sales_analysis.get_product_sales_by_date(101, '2023-10-26', @output_quantity);
SELECT @output_quantity;
This procedure encapsulates a common query, making it callable with different parameters. You can think of the IN parameters as inputs to your function and OUT parameters as outputs. BigQuery Stored Procedures can also declare multiple OUT parameters or return a table.
The real magic of stored procedures, however, is their ability to manage state and perform complex logic that goes beyond a single SELECT statement. They are compiled and executed within BigQuery’s infrastructure, which means they benefit from BigQuery’s scalability and performance. This is particularly useful for data transformation pipelines, complex reporting, or when you need to perform a series of SQL operations that are logically grouped together.
Consider a procedure that updates a status flag for orders that are overdue.
CREATE OR REPLACE PROCEDURE order_management.mark_overdue_orders()
BEGIN
-- Update orders that are past their due date and not yet completed.
UPDATE order_management.orders
SET status = 'OVERDUE'
WHERE due_date < CURRENT_DATE()
AND status != 'COMPLETED';
-- Log the update operation.
INSERT INTO order_management.audit_log (operation, timestamp)
VALUES ('MARKED_OVERDUE_ORDERS', CURRENT_TIMESTAMP());
END;
Calling this procedure CALL order_management.mark_overdue_orders(); would execute both the UPDATE and the INSERT statements atomically within the procedure.
The BEGIN...END block is where the procedural logic lives. You can include multiple SQL statements, variable declarations, and even control flow statements like IF and LOOP (though these are less common in pure SQL-based procedures and more prevalent in scripting). BigQuery Stored Procedures are written in a SQL dialect that supports these procedural extensions.
The OPTIONS clause is a subtle but powerful feature. For example, you can set description or labels for your procedure, which helps with organization and discoverability in your BigQuery project.
CREATE OR REPLACE PROCEDURE sales_analysis.get_product_sales_by_date(...)
OPTIONS (
description="Retrieves total quantity sold for a product on a specific date.",
labels=[("team", "sales"), ("use_case", "reporting")]
)
BEGIN
-- ... procedure body ...
END;
When you create a stored procedure, BigQuery compiles it and stores it as a dataset object. This means it’s versioned and managed within BigQuery itself, not as external files. This compilation step also allows BigQuery to optimize the execution plan for the procedure.
The most surprising thing about BigQuery Stored Procedures is their ability to manage transactions implicitly. When you execute a procedure, all the SQL statements within it are treated as a single atomic unit. If any statement fails, the entire procedure is rolled back, ensuring data consistency. This is a critical feature for maintaining data integrity in complex data pipelines.
A common pattern is to use a procedure to encapsulate a data loading or transformation step that involves multiple tables or complex filtering. For instance, you might have a procedure that aggregates daily sales data into a monthly summary table.
CREATE OR REPLACE PROCEDURE data_aggregation.aggregate_monthly_sales()
BEGIN
-- Clear previous month's summary to avoid duplicates
DELETE FROM data_aggregation.monthly_sales_summary
WHERE sale_month = DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH);
-- Insert aggregated data for the previous month
INSERT INTO data_aggregation.monthly_sales_summary (sale_month, total_revenue, distinct_products)
SELECT
DATE_TRUNC(sale_date, MONTH) AS sale_month,
SUM(quantity * price) AS total_revenue,
COUNT(DISTINCT product_id) AS distinct_products
FROM sales_analysis.sales
WHERE sale_date >= DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH)
AND sale_date < DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY 1;
END;
You would then schedule this procedure to run monthly. The DELETE statement ensures that if the procedure is run multiple times for the same month (e.g., due to a retry), it won’t create duplicate aggregate records. The second INSERT then populates the summary for the previous month, which is a common pattern for batch aggregation jobs.
The next step you’ll likely encounter is exploring how to handle errors gracefully within stored procedures using EXCEPTION blocks.