Flask apps are notoriously bad at handling concurrent requests out of the box.

Here’s a Flask app:

from flask import Flask, request
import time

app = Flask(__name__)

@app.route('/slow')
def slow_request():
    sleep_time = int(request.args.get('time', 5))
    time.sleep(sleep_time)
    return f"Slept for {sleep_time} seconds."

if __name__ == '__main__':
    app.run(debug=False, host='0.0.0.0', port=5000)

If you run this with python app.py and open two browser tabs, one to http://localhost:5000/slow?time=10 and another to http://localhost:5000/slow?time=5, the second request will wait until the first one finishes. The Flask development server is single-threaded and single-process, meaning it can only handle one request at a time. This is fine for local development, but it’s a non-starter for production.

To handle multiple requests concurrently, we need a production-ready WSGI server. Gunicorn is a popular choice. But Gunicorn, by default, uses synchronous workers. If we were to run our Flask app with Gunicorn using its default synchronous workers, we’d still face the same concurrency limitations: one worker process, one request at a time.

This is where gevent comes in. Gevent is a coroutine-based networking library that allows us to write concurrent code that looks synchronous. When a gevent worker encounters an I/O-bound operation (like waiting for a database query or an external API call, or in our example, time.sleep), it can yield control back to the gevent event loop, allowing other requests to be processed by the same worker.

So, the goal is to run Gunicorn with gevent workers. This means Gunicorn will spawn multiple worker processes, and each worker process will be capable of handling many requests concurrently using gevent’s cooperative multitasking.

First, install the necessary libraries:

pip install Flask gunicorn gevent

Now, let’s configure Gunicorn to use gevent workers. A common way to do this is via the command line:

gunicorn --workers 4 --worker-class gevent --bind 0.0.0.0:8000 app:app

Let’s break this down:

  • gunicorn: The command to start Gunicorn.
  • --workers 4: This tells Gunicorn to spawn 4 worker processes. The optimal number of workers is usually 2 * number_of_cpu_cores + 1. For a typical server, 4 is a reasonable starting point.
  • --worker-class gevent: This is the crucial part. It instructs Gunicorn to use gevent’s worker class, enabling asynchronous I/O handling within each worker.
  • --bind 0.0.0.0:8000: This tells Gunicorn to listen on all network interfaces (0.0.0.0) on port 8000. You’d typically use a public IP and a higher port, or a Unix socket, in a real deployment.
  • app:app: This specifies the WSGI application. app refers to the Python module (your app.py file), and the second app refers to the Flask application instance within that module.

To see this in action, run the command above. Then, open two browser tabs:

  1. http://your_server_ip:8000/slow?time=10
  2. http://your_server_ip:8000/slow?time=5

This time, the second request will not wait for the first one to complete. It will start executing almost immediately because the gevent worker, upon hitting time.sleep, yields control, allowing the second request to be picked up and processed by the same worker.

The mental model here is that Gunicorn manages the processes, and gevent manages the concurrency within each process. Each Gunicorn worker process runs its own gevent event loop. When a gevent worker receives a request that involves waiting for I/O, it "switches" to handling another request instead of blocking. This switching is very lightweight, allowing a single worker to juggle many requests that are waiting for external resources.

The number of workers is still important for CPU-bound tasks. If your Flask app were doing heavy computation, you’d want more workers to utilize multiple CPU cores. But for I/O-bound workloads, gevent workers allow a smaller number of processes to handle a much larger number of concurrent connections efficiently.

What most people don’t realize is that gevent.monkey_patch() is often necessary to make standard Python libraries (like time.sleep, socket, ssl, etc.) cooperative. Without monkey patching, these libraries wouldn’t automatically yield control when waiting for I/O. Gunicorn’s gevent worker class usually handles this patching for you implicitly when it starts up, but it’s good to be aware of it. If you encounter unexpected blocking behavior with gevent, explicitly calling gevent.monkey_patch() at the very beginning of your application’s entry point (before importing other libraries) can resolve it.

With this setup, your Flask application can now efficiently handle many concurrent I/O-bound requests, making it suitable for production environments.

The next thing you’ll likely encounter is managing application state across these concurrent requests, especially if your Flask app relies on global variables or caches.

Want structured learning?

Take the full Flask course →