Cloud Run doesn’t actually "warm" instances; it keeps a minimum number of containers ready to serve traffic, and the "cold start" you’re trying to avoid is the time it takes for a new container to provision, start, and receive its first request.
Let’s see this in action. Imagine you have a simple Python Flask app deployed to Cloud Run.
from flask import Flask
import time
import os
app = Flask(__name__)
@app.route('/')
def hello_world():
start_time = time.time()
# Simulate some work
time.sleep(2)
end_time = time.time()
return f"Hello from Cloud Run! Request processed in {end_time - start_time:.2f} seconds."
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))
If this service has min-instances set to 0 (the default), and no traffic has hit it for a while, the first request might take several seconds. This is because Cloud Run needs to:
- Provision a new container instance: Allocate resources from Google’s infrastructure.
- Download your container image: Fetch the image layers from Google Container Registry or Artifact Registry.
- Start the container: Execute your application’s entrypoint.
- Initialize your application: Run any startup code (like loading models, connecting to databases, etc.).
- Receive the request: Finally, route the incoming HTTP request to your running application.
This entire sequence is the "cold start."
Now, let’s say you deploy this same service, but this time you configure min-instances to 1.
You can set this during deployment using gcloud:
gcloud run deploy YOUR_SERVICE_NAME \
--image gcr.io/YOUR_PROJECT_ID/YOUR_IMAGE_NAME:latest \
--platform managed \
--region YOUR_REGION \
--min-instances 1 \
--max-instances 5 \
--allow-unauthenticated
Or, you can update an existing service:
gcloud run services update YOUR_SERVICE_NAME \
--platform managed \
--region YOUR_REGION \
--min-instances 1
With min-instances: 1, Cloud Run ensures that at least one container instance is always running and ready to serve traffic, even if there’s no incoming request. When a request arrives, it’s immediately routed to this pre-warmed instance, bypassing the provisioning and startup phases. The request processing time will then only reflect the time your application takes to handle the request itself, not the container startup.
The primary problem this solves is latency-sensitive applications. If your users expect sub-second responses, or if your application is part of a critical request/response chain where delays are unacceptable, cold starts are a non-starter. This includes APIs for user-facing applications, real-time data processing, or any service where even a few seconds of delay can degrade the user experience or break downstream systems.
The mental model here is that min-instances acts like a reservation system. You’re telling Cloud Run, "I want to keep at least X number of these application servers humming and ready to go at all times." When traffic spikes, Cloud Run will scale up to max-instances by starting new instances, but those new instances will also experience a cold start. The min-instances setting only guarantees that a baseline number of instances are already warm.
It’s crucial to understand that min-instances incurs costs. You are billed for the CPU and memory allocated to these idle instances for the entire time they are kept running, regardless of whether they are serving requests. This is the trade-off for eliminating cold starts. If your application experiences infrequent but critical traffic, or if you can tolerate occasional cold starts for cost savings, setting min-instances to 0 might be more appropriate.
The underlying mechanism for keeping instances warm involves Cloud Run’s control plane continuously monitoring the min-instances count. If the number of running instances drops below the minimum, the control plane automatically provisions new ones. It also attempts to keep these instances "healthy" by periodically sending health check requests, though this is a simpler check than an actual application request.
When you set min-instances to a value greater than zero, Cloud Run maintains that number of container instances in a ready state. This means that when a request comes in, it can be routed to an already-running container, effectively eliminating the time spent on provisioning, image pulling, and application startup. This is achieved by the Cloud Run control plane actively managing the lifecycle of these guaranteed instances, ensuring they are always available.
The next thing you’ll want to consider is how to manage the cost implications of keeping instances warm, especially for applications with highly variable traffic patterns.