BQML lets you train and deploy machine learning models directly within BigQuery using SQL.
Let’s see BQML in action. Imagine you have a table of customer data, your_project.your_dataset.customer_data, with columns like customer_id, age, gender, annual_income, spending_score (a score from 1 to 100). You want to predict a customer’s spending_score based on their other attributes.
CREATE OR REPLACE MODEL your_project.your_dataset.customer_spending_model
OPTIONS(
model_type='linear_reg',
input_label_cols=['spending_score']
) AS
SELECT
age,
gender,
annual_income,
spending_score
FROM
your_project.your_dataset.customer_data
WHERE
-- Filter out any rows with missing target values for training
spending_score IS NOT NULL;
This CREATE OR REPLACE MODEL statement defines a linear regression model named customer_spending_model. It specifies linear_reg as the model_type and spending_score as the input_label_cols. The AS clause provides the training data, selecting the relevant features and the target variable from your customer_data table. BQML handles the data preprocessing, feature engineering (like one-hot encoding for gender), and model training all within BigQuery’s infrastructure.
Once trained, you can use the model for predictions:
SELECT
customer_id,
predicted_spending_score
FROM
ML.PREDICT(MODEL your_project.your_dataset.customer_spending_model,
(
SELECT
customer_id,
age,
gender,
annual_income
FROM
your_project.your_dataset.customer_data
WHERE
-- For prediction, we only need features, not the label
spending_score IS NULL -- Example: predicting for customers without a score yet
)
);
This ML.PREDICT function takes your trained model and new data (customers for whom you want to predict spending_score) and returns the predictions. The output column will be named predicted_spending_score by default.
The core problem BQML solves is the friction between data warehousing and machine learning. Traditionally, you’d extract data from BigQuery, load it into a separate ML environment (like Vertex AI, SageMaker, or a local Python setup), train a model, and then potentially load the model back or use it to score data in BigQuery. BQML eliminates these ETL steps for model training and inference, allowing you to leverage your data where it lives. It supports various model types including linear regression, logistic regression, k-means clustering, boosted trees (XGBoost), deep neural networks, and matrix factorization for recommendations.
Internally, BQML orchestrates the training process by leveraging BigQuery’s distributed query engine. When you run a CREATE MODEL statement, BigQuery converts the SQL into an execution plan that includes data scanning, feature transformation, and distributed model training. For complex models like boosted trees or DNNs, it might leverage Vertex AI’s managed training infrastructure behind the scenes, but you interact with it purely through SQL. The model itself is stored as a BigQuery object, making it versionable and manageable within your data warehouse.
You control the model’s behavior through the OPTIONS clause. For example, to train a logistic regression model for binary classification, you’d set model_type='logistic_reg' and ensure your input_label_cols contains a binary target variable. Hyperparameter tuning can be automated using num_trials in the OPTIONS for models like boosted trees, where BQML searches for optimal values for parameters like max_depth or learning_rate. You can also specify evaluation metrics to monitor during training, such as loss_track for regression or accuracy for classification.
When using linear_reg or logistic_reg models, BQML automatically handles categorical features by applying one-hot encoding. However, for more complex transformations or custom feature engineering, you can pre-process your data using standard SQL or use TRANSFORM clauses within the CREATE MODEL statement. This allows you to create polynomial features, interaction terms, or apply scaling directly before training. For instance, if you wanted to create an interaction term between age and annual_income, you could include age * annual_income AS age_income_interaction in your SELECT statement within the CREATE MODEL query, and then BQML would treat this new column as a feature.
The ML.EVALUATE function is crucial for understanding your model’s performance after training. It provides metrics relevant to the model type. For a linear_reg model, this includes Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared. For logistic_reg, you’ll see accuracy, precision, recall, F1-score, and AUC.
SELECT
*
FROM
ML.EVALUATE(MODEL your_project.your_dataset.customer_spending_model);
This will give you a detailed breakdown of how well your model is performing on unseen data (if you’ve split your data for training and evaluation) or on the training data itself.
Many users overlook the importance of explicitly handling missing values before training, even though BQML has some internal imputation capabilities for certain model types. While BQML can impute missing values for categorical features with the mode and for numerical features with the mean or median, relying solely on this can mask underlying data quality issues or lead to suboptimal model performance. It’s generally best practice to inspect your data for missing values using COUNTIF(column IS NULL) and decide on an explicit imputation strategy (e.g., filling with a specific value, using more advanced imputation techniques, or even removing rows/columns) in your SELECT statement before the CREATE MODEL call. This gives you granular control and ensures your model is trained on the most representative data.
The next step is often exploring feature importance to understand which attributes most influence your model’s predictions.