BigQuery’s Remote Functions let you call external HTTP APIs directly from your SQL queries, but they’re not just a fancy curl command; they’re a full-fledged extension of your data processing pipeline, capable of enriching terabytes of data with real-time external context.
Let’s see one in action. Imagine we have a table of product IDs and we want to fetch their current prices from an external pricing service.
-- Assume 'my_dataset.product_ids' has a column 'product_id'
SELECT
product_id,
remote_function_call(
'my_project.my_dataset.get_product_price', -- The fully qualified name of your remote function
STRUCT(product_id AS id) -- Input arguments as a STRUCT
) AS price_info
FROM
my_dataset.product_ids
LIMIT 100;
Here, remote_function_call is the magic. It takes the fully qualified name of your Cloud Function (or Cloud Run service) and a STRUCT containing the input arguments for that function. The output of the remote function is then directly available as a column in your BigQuery result.
How it Works Under the Hood
At its core, BigQuery Remote Functions are an implementation of the Remote Procedure Call (RPC) pattern, specifically tailored for HTTP. When you execute a query with remote_function_call, BigQuery doesn’t directly execute your SQL against the external API. Instead, it acts as an orchestrator:
- Data Batching: BigQuery collects the input rows for the remote function call. For performance, it batches these rows into a single request to your external service. The maximum batch size is 1000 rows.
- Request Serialization: BigQuery serializes the batched input rows into a JSON payload. This payload is structured according to a specific schema defined by the BigQuery Remote Functions protocol. Each row becomes an element in a JSON array, with column names as keys.
- HTTP Invocation: BigQuery makes an HTTP POST request to the endpoint you’ve configured for your remote function. This endpoint is typically a Cloud Function or a Cloud Run service. The payload containing your data is sent in the request body.
- External Service Execution: Your external service receives the HTTP request, deserializes the JSON payload, and processes each row. This is where your custom logic lives – calling an external API, performing a calculation, etc.
- Response Serialization: Your external service processes the batch and returns a JSON response. This response must adhere to the BigQuery Remote Functions protocol, containing an array of results, one for each input row, and an optional
errorMessagefield. - Result Deserialization: BigQuery receives the HTTP response, validates its schema, and deserializes the results. These results are then integrated back into your BigQuery query, appearing as the output of the
remote_function_call. - Error Handling: If your external service returns an
errorMessagein its response, or if the HTTP request itself fails (e.g., timeout, invalid response), BigQuery will propagate this error back to your SQL query, failing the query execution.
The Levers You Control
The power of remote functions comes from your ability to define the external logic. You’re essentially writing a small, stateless service that BigQuery can call.
- Input Schema: The
STRUCTyou pass toremote_function_calldefines the input signature of your remote function. BigQuery expects your external service to accept these arguments. - Output Schema: The return type of your BigQuery remote function (defined when you create the function) dictates the structure of the JSON response your external service must provide. If your function is defined to return a
STRUCT<price FLOAT64, currency STRING>, your service must return a JSON array where each element is an object like{"price": 19.99, "currency": "USD"}. - Endpoint Configuration: You specify the HTTP endpoint (Cloud Function URL or Cloud Run service URL) when you create the remote function in BigQuery. This is the target BigQuery will POST to.
- Authentication: For secure access, you can configure authentication for your remote function. This typically involves service accounts, allowing BigQuery to authenticate with your Cloud Function or Cloud Run service.
- Batching Behavior: While BigQuery handles the batching, understanding that up to 1000 rows are sent at once is crucial for designing your external service to be efficient.
The most impactful lever you have is the design of your external service’s response. It needs to be exactly the inverse of the input, with a one-to-one mapping between input rows and output results, and crucially, it must include an errorMessage field if anything goes wrong. If your service crashes or returns malformed JSON, BigQuery will simply report a generic error, leaving you to debug the external service’s logs.
The next step in mastering this is understanding how to handle partial failures within a batch, where some rows succeed and others fail, and how BigQuery’s error reporting surface can help pinpoint the exact row that caused an issue.