Cloud Run’s request concurrency setting is not about how many requests your service can handle simultaneously, but how many are actively being processed by a single container instance at any given moment.
Let’s watch this in action. Imagine we have a simple Python Flask app that simulates work by sleeping for 5 seconds:
from flask import Flask
import time
import os
app = Flask(__name__)
@app.route('/')
def hello_world():
sleep_duration = int(os.environ.get("SLEEP_DURATION", 5))
print(f"Instance received request, sleeping for {sleep_duration} seconds...")
time.sleep(sleep_duration)
print("Instance finished sleeping.")
return 'Hello, World!'
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))
We deploy this to Cloud Run. By default, the maximum concurrency is 80. If we send 80 requests very quickly to a single instance, they will queue up and be processed one by one within that instance. The instance is busy for 5 seconds per request, so 80 requests will take 400 seconds (almost 7 minutes) to complete if they all hit the same instance. However, if we send the 81st request just a millisecond after the first one, Cloud Run will likely spin up a new instance to handle it, because the first instance is already "busy" with its 80-request concurrency limit.
The problem Cloud Run solves is scaling applications without managing servers. You provide your container, and Cloud Run handles provisioning, scaling, and routing. The key levers you control are CPU/memory allocation, scaling (min/max instances), and request concurrency.
Understanding concurrency is crucial for performance and cost. If your requests are very short-lived (e.g., < 1 second), you can set a high concurrency (e.g., 1000) to maximize the work done by a single instance, potentially reducing the number of instances needed and saving money. If your requests are long-running (e.g., > 10 seconds), you’ll want to set concurrency lower (e.g., 1 to 10) to prevent a single instance from becoming overloaded and unresponsive. Cloud Run will then scale out to more instances to handle the load.
The default concurrency of 80 is a sensible starting point, but it’s an empirical tuning parameter. For services with many very short requests, you might push this to 1000. For services with requests that take several seconds, you might pull it down to 5 or 10. The goal is to find the sweet spot where a single instance is doing as much work as possible without becoming a bottleneck. When concurrency is reached, Cloud Run’s autoscaler will launch a new instance if the minimum instance count hasn’t been met and the maximum instance count hasn’t been reached.
The most surprising thing is how many requests a single instance can queue and process sequentially when concurrency is high, even if the individual requests take a long time. This means the actual latency experienced by a user can be significantly higher than the time.sleep() duration if they hit an instance that’s already at its concurrency limit. Cloud Run’s internal request queue for a single instance is effectively the concurrency limit.
If you configure your concurrency to be lower than the number of requests actively being processed by your application code within a single container instance (e.g., your code spawns multiple threads or goroutines that are all doing I/O or CPU work), you’ll see requests being dropped or timed out by Cloud Run itself, even if the container could technically handle more.