Cloud Run can now run GPU workloads, which is a game-changer for ML inference because it finally bridges the gap between serverless ease-of-use and the raw compute power needed for real-time AI.

Let’s see it in action. Imagine you have a trained TensorFlow model that does image classification. You want to serve this model via an API endpoint, and you want it to be fast and scalable.

Here’s a conceptual Dockerfile for a simple Flask application that serves a TensorFlow model:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install TensorFlow with GPU support
RUN pip install --no-cache-dir tensorflow[and-cuda]

COPY app.py .
COPY model/ /app/model/

ENV PORT 8080
EXPOSE 8080

CMD ["python", "app.py"]

And here’s a basic app.py that loads the model and has a prediction endpoint:

import os
import tensorflow as tf
from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

# Load the model once when the application starts
# Assumes your model is saved in the 'model' directory
model_path = os.path.join(os.path.dirname(__file__), 'model', 'your_model.h5')
model = tf.keras.models.load_model(model_path)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    # Assuming input data is a list of images, preprocess as needed
    # Example: convert to numpy array and reshape
    input_data = np.array(data['images']).astype(np.float32)

    # Perform inference on GPU if available
    with tf.device('/GPU:0'):
        predictions = model.predict(input_data)

    return jsonify({'predictions': predictions.tolist()})

if __name__ == '__main__':
    port = int(os.environ.get('PORT', 8080))
    app.run(host='0.0.0.0', port=port)

To deploy this to Cloud Run with GPU support, you’d build the Docker image and then deploy it using the gcloud CLI. The key is specifying the --gpu-runtime and --gpu-count flags.

# Build the Docker image
docker build -t gcr.io/YOUR_PROJECT_ID/gpu-inference-app:latest .

# Push the image to Google Container Registry
docker push gcr.io/YOUR_PROJECT_ID/gpu-inference-app:latest

# Deploy to Cloud Run with GPU
gcloud run deploy gpu-inference-service \
  --image gcr.io/YOUR_PROJECT_ID/gpu-inference-app:latest \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 8Gi \
  --cpu 4 \
  --gpu-runtime nvidia \
  --gpu-count 1 \
  --min-instances 0 \
  --max-instances 5

This setup allows Cloud Run to automatically provision GPU-enabled instances when your service receives traffic, execute your model inference using the powerful NVIDIA GPUs, and scale down to zero when idle, just like traditional serverless workloads.

The problem this solves is the operational overhead of managing GPU instances. Historically, running ML inference at scale meant provisioning, configuring, and maintaining dedicated GPU VMs, which is complex and costly, especially for variable or spiky inference workloads. Cloud Run abstracts all of that away. You just provide your containerized application, specify your GPU needs, and Cloud Run handles the rest. It dynamically allocates GPUs to your container instances, ensuring that your inference requests are processed with the low latency and high throughput that GPUs provide, without you needing to manage any underlying infrastructure.

Internally, Cloud Run leverages Google Cloud’s powerful GPU infrastructure. When you request a GPU, Cloud Run ensures that your container is scheduled on a node that has an available GPU. The nvidia runtime means it’s using NVIDIA drivers and CUDA libraries, which are standard for most ML frameworks. The gpu-count specifies how many GPUs each instance should have, and memory and cpu are crucial for the overall performance of your application alongside the GPU. For example, if your model requires significant pre-processing or post-processing that is CPU-bound, you’ll need to allocate sufficient CPU and memory resources to match.

The mental model here is that Cloud Run treats GPUs as just another resource to be allocated to your container. You don’t SSH into a GPU machine; your container runs, and if it needs a GPU, it gets one assigned to its process. This is enabled by a sophisticated scheduler that understands GPU availability across the fleet of underlying machines. The nvidia runtime is a critical component that ensures the necessary NVIDIA drivers and libraries are present and correctly configured within the container’s environment, allowing TensorFlow or PyTorch to detect and utilize the GPU.

Here’s a counterintuitive point: while you’re deploying a container that explicitly installs tensorflow[and-cuda], you don’t need to worry about the CUDA toolkit version matching the driver version on the host machine. Cloud Run’s GPU runtime abstracts this away. It provides an environment where your installed libraries can communicate with the assigned GPU without manual driver management. The nvidia runtime essentially injects the necessary host drivers into your container’s accessible environment, making it seamless for your application to utilize the hardware.

The next step after successfully deploying your GPU-accelerated inference service is to optimize its performance and cost by fine-tuning the resource allocation, particularly memory, cpu, and gpu-count, based on real-world request patterns and observed latencies.

Want structured learning?

Take the full Cloud-run course →