Cloud Run services can stream responses back to clients, which is crucial for building low-latency, real-time APIs.

Here’s a real-time API service written in Python using Flask that streams a response:

from flask import Flask, Response, stream_with_context
import time

app = Flask(__name__)

@app.route('/stream')
def stream_data():
    def generate():
        for i in range(10):
            yield f"Data chunk {i} at {time.time()}\n"
            time.sleep(1)
    return Response(stream_with_context(generate()), mimetype='text/plain')

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=8080)

When you deploy this to Cloud Run and make a request to /stream, you won’t get the whole response at once. Instead, you’ll see chunks of data appearing in your client’s output, one per second, for 10 seconds.

The core problem this solves is the "all or nothing" response. Traditionally, an API waits until it has the entire response ready, then sends it. For large datasets, long-running computations, or interactive experiences (like a live chat interface), this latency is unacceptable. Streaming allows the server to send data as it becomes available, significantly reducing the perceived and actual time to first byte.

Internally, streaming works by sending the HTTP response headers first, followed by a series of data chunks. The client, upon receiving the headers and seeing a Transfer-Encoding: chunked header (which is often automatically handled by web servers and frameworks), knows to expect subsequent data chunks until the connection is closed or a special "end chunk" is received. In our Python example, Response(stream_with_context(generate()), mimetype='text/plain') tells Flask to treat the generate function’s output as a stream. stream_with_context is important because it ensures that the request context is available within the generator function, which is necessary for things like accessing request data or session information if your stream generation depends on them.

The primary lever you control is the mimetype. While text/plain is used here for simplicity, you’d typically use text/event-stream for Server-Sent Events (SSE), a common pattern for real-time updates. For more complex, bidirectional communication, WebSockets are the standard, though Cloud Run’s direct support for WebSockets requires specific configuration via the ingress setting in your service’s YAML or gcloud command. For SSE, the server sends events formatted as data: ...\n\n, and the client parses these. The key is that the connection remains open, allowing the server to push new data.

The most common misconception is that streaming is only for "big data." In reality, the performance gains are most pronounced for latency-sensitive applications, not necessarily data-volume-sensitive ones. A single, small piece of data delivered in 50ms via streaming feels much faster than the same piece of data delivered in 500ms after a full request-response cycle. The overhead of initiating and closing HTTP connections is amortized over the lifetime of the stream, making it more efficient for frequent, small updates.

When you move from simple text streaming to Server-Sent Events (SSE), you’ll need to manage event IDs and retry mechanisms on the client-side to ensure robustness.

Want structured learning?

Take the full Cloud-run course →