Cloud Run’s "CPU always allocated" setting can actually save you money, not cost you more.

Let’s see it in action. Imagine you have a web service that gets sporadic traffic. Without "CPU always allocated," Cloud Run might scale down to zero instances during idle periods. When a request comes in, it needs to start a new instance, which takes time. This "cold start" can lead to a bad user experience or, more importantly for this discussion, missed opportunities if your service is performing a time-sensitive task.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-cpu-intensive-service
spec:
  template:
    spec:
      containers:
        - image: gcr.io/my-project/my-cpu-app:latest
          resources:
            limits:
              cpu: "1"
              memory: "512Mi"
      # This is the key setting:
      containerConcurrency: 1 # Or whatever makes sense for your app
      # To enable CPU always allocated, you'd typically set this in the Cloud Run UI
      # or via gcloud, not directly in the Knative YAML for Cloud Run.
      # Example gcloud command:
      # gcloud run services update my-cpu-intensive-service --cpu-throttling=false --region=us-central1

The problem this solves is the latency introduced by cold starts for services that need to be immediately responsive. When "CPU always allocated" is true (or CPU throttling is disabled in gcloud), Cloud Run keeps at least one instance of your service running, even when there are no active requests. This instance has its allocated CPU available.

Here’s how it works internally. Normally, Cloud Run scales down instances to zero when they’re idle to save resources and money. With "CPU always allocated" enabled, Cloud Run treats the minimum number of instances (which is 1 by default if you’re not explicitly scaling to zero) as needing continuous CPU allocation. This means the instance is always ready to receive a request.

The levers you control are primarily in the Cloud Run console or via gcloud run services update. You can toggle CPU throttling on or off. When CPU throttling is off, CPU is always allocated. You also control the number of CPU cores and memory allocated to each instance, which impacts both performance and cost.

The surprising part is that for services with frequent, short bursts of activity, keeping an instance "warm" with CPU allocated can be cheaper than paying for the overhead of frequent scaling up and down. Each scale-up involves provisioning and starting a container, which has a small but non-zero cost and latency. If your service is performing tasks like real-time data processing, financial transactions, or serving dynamic content that needs to respond instantly, the cost of a continuously allocated CPU on a single instance might be less than the cumulative cost and performance penalty of cold starts. You’re essentially paying a fixed small price for an always-ready instance, avoiding the variable costs and unpredictable delays of scaling.

The next concept you’ll likely encounter is optimizing instance count and scaling behavior for mixed workloads.

Want structured learning?

Take the full Cloud-run course →