Cloud Run services can balloon in cost faster than you can say "serverless" if you’re not careful about resource allocation and scaling.

Here’s how to reclaim your budget:

Let’s see a typical Cloud Run service in action. Imagine we have an API that processes image uploads. It’s not super CPU-intensive, but it does need a decent chunk of memory for image manipulation.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: image-processor
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "100"
    spec:
      containers:
      - image: gcr.io/my-project/image-processor:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "1000m" # 1 vCPU
            memory: "1Gi" # 1 Gigabyte

This setup might seem reasonable, but it’s likely bleeding money. The resources.requests section is where the magic (and the cost) happens. Cloud Run allocates this amount of CPU and memory for every instance it spins up, regardless of whether that instance is actively processing requests.

The core problem is that Cloud Run provisions resources based on spec.template.spec.containers.resources.requests. If these values are set higher than what your application actually needs during its typical workload, you’re paying for idle capacity. This applies to both CPU and memory.

Cause 1: Over-provisioned CPU

Your service might be configured with 1000m (1 vCPU) per instance, but your application only spikes to 200m (0.2 vCPU) during peak load. You’re paying for 800m of idle CPU per instance.

  • Diagnosis: Install google-cloud-sdk and use gcloud run services describe IMAGE_PROCESSOR_SERVICE_NAME --region=REGION --format='value(spec.template.spec.containers.resources.requests.cpu)'. Then, monitor your service’s CPU utilization in the Google Cloud Console under Cloud Run -> Your Service -> Metrics. Look for the "CPU utilization" graph.
  • Fix: Reduce resources.requests.cpu to a value that covers your peak sustained usage. For example, if your monitoring shows a consistent peak of 500m, change the request to "500m".
  • Why it works: Cloud Run guarantees this CPU will be available. By requesting less, you reduce the guaranteed allocation per instance, and thus the cost.

Cause 2: Over-provisioned Memory

Similarly, your service might be requesting 1Gi of memory, but your application typically uses only 256Mi. The extra memory is reserved and billed even if unused.

  • Diagnosis: Use gcloud run services describe IMAGE_PROCESSOR_SERVICE_NAME --region=REGION --format='value(spec.template.spec.containers.resources.requests.memory)'. Monitor "Memory utilization" in the Cloud Console metrics.
  • Fix: Lower resources.requests.memory to your actual peak requirement. If your monitoring shows a peak of 512Mi, set it to "512Mi".
  • Why it works: Memory requests are also guaranteed. Reducing this lowers the baseline memory allocation per instance.

Cause 3: Too Many Minimum Instances

You’ve set min-instances to 5 to ensure zero cold starts. However, your traffic is low, and typically only 1 or 2 instances are needed. Those 5 instances are running 24/7, consuming resources and incurring costs even when idle.

  • Diagnosis: Check gcloud run services describe IMAGE_PROCESSOR_SERVICE_NAME --region=REGION --format='value(spec.template.metadata.annotations.autoscaling.knative.dev/minScale)'.
  • Fix: Reduce min-instances to a value that balances cold start tolerance with cost. If you find only 1-2 instances are consistently active, try setting min-instances to 1 or 2. This is done by adding or updating an annotation:
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/minScale: "2" # Example: set to 2
    
  • Why it works: Fewer always-on instances mean less constant resource consumption and billing.

Cause 4: Inefficient Container Image

A large container image means longer deployment times and potentially more resources needed during startup. While not a direct per-instance cost, it impacts scaling efficiency.

  • Diagnosis: Check the size of your container image on Google Container Registry or Artifact Registry. Look for the IMAGE_SIZE for gcr.io/my-project/image-processor:latest.
  • Fix: Use multi-stage builds in your Dockerfile to include only necessary artifacts in the final image. Use smaller base images (e.g., alpine variants). Remove unnecessary build tools and dependencies.
  • Why it works: Smaller images deploy faster, allowing Cloud Run to scale up and down more efficiently, reducing the time instances might be over-provisioned during deployments.

Cause 5: Unnecessary Concurrency

Cloud Run allows multiple requests per instance (concurrency). If your application isn’t thread-safe or designed for concurrent requests, you might be hitting issues that require more instances than a single-request-per-instance model.

  • Diagnosis: Review your application code to understand its concurrency model. Check your Cloud Run service’s concurrency setting (default is 80).
  • Fix: If your application can only safely handle one request at a time, set concurrency to 1. This is done via the autoscaling.knative.dev/concurrency annotation:
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/concurrency: "1" # Example: set to 1
    
  • Why it works: By forcing one request per instance, you ensure that each instance is fully utilized when active. If your app is slow, this will cause Cloud Run to scale out more aggressively, but it prevents one slow request from blocking others and potentially leading to higher overall instance counts due to timeouts or queueing.

Cause 6: Long Request Timeouts

A default request timeout of 5 minutes (300 seconds) can mean an instance is held up for a long time by a single, slow request. If this happens frequently, it can lead to more instances being spun up to handle the backlog.

  • Diagnosis: Check the timeoutSeconds field in your Cloud Run service configuration (defaults to 300 seconds).
  • Fix: Set a more appropriate, shorter timeout if your requests should not take that long. This is done via the timeoutSeconds annotation:
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/timeout: "60s" # Example: set to 60 seconds
    
  • Why it works: Shorter timeouts mean slow requests fail faster, preventing them from hogging resources and forcing Cloud Run to scale out unnecessarily to compensate for perceived load.

The next hurdle you’ll likely face is understanding the nuances of CPU throttling and how it interacts with your configured requests.

Want structured learning?

Take the full Cloud-run course →