Consul’s rate-limit configuration doesn’t actually limit traffic at the network level; it controls how many requests a service acknowledges within a given time window.

Let’s see it in action. Imagine we have two services, frontend and backend. The frontend calls the backend for data. We want to limit the frontend to 10 requests per second to the backend.

Here’s the backend service definition in Consul:

{
  "service": {
    "name": "backend",
    "port": 8080,
    "id": "backend-01",
    "checks": [
      {
        "name": "HTTP API",
        "http": "http://localhost:8080/health",
        "interval": "10s"
      }
    ]
  }
}

And here’s how we’d configure the rate limit on the backend service using Consul’s API:

curl --request PUT --data '{
  "RateLimit": {
    "LocalService": {
      "Policies": [
        {
          "Before": [
            {
              "GenericRateLimiter": {
                "NumTokens": 10,
                "FillRate": 10,
                "FillInterval": "1s"
              }
            }
          ]
        }
      ]
    }
  }
}' http://localhost:8500/v1/config

When a frontend service instance tries to connect to backend, Consul’s client agent (running alongside backend) intercepts the request. It checks the rate-limit configuration. If the number of requests in the last second exceeds 10, the backend service’s agent will respond with a 429 Too Many Requests status code, before the request even reaches the backend application itself. The NumTokens is the capacity of the bucket, and FillRate is how many tokens are added per FillInterval. So, a bucket of 10 tokens, filled at 10 tokens per second, means it can handle bursts up to 10 requests and then sustains 10 requests per second.

The key takeaway is that this isn’t a network-level firewall. It’s a request-acknowledgment limit enforced by the Consul client agent on the destination service. The frontend service doesn’t see a network-level block; it sees its requests being rejected by the backend with a 429. This means your application logic on the frontend needs to be prepared to handle these 429 responses, typically by implementing retry mechanisms with exponential backoff.

The RateLimit configuration can be applied to the LocalService (the service the config is applied to) or DestinationServices (other services this service can talk to). You can also define policies based on the HTTP method, path, and even custom headers, allowing for very granular control. For instance, you could limit POST requests to /api/v1/users to 5 per minute, while allowing GET requests to /api/v1/health without limits.

This mechanism is also how Consul’s built-in API Gateway functionality leverages rate limiting. When you configure an API Gateway, you’re essentially applying these rate-limiting policies to the ingress traffic destined for your services, ensuring that external clients don’t overwhelm your internal infrastructure.

A common pitfall is assuming this rate limit is enforced at the ingress point of your entire mesh. It’s crucial to remember that the rate-limit configuration is applied to the target service’s Consul agent. If you want to apply a global rate limit for all traffic entering your mesh, you’d typically use a dedicated ingress gateway service and configure rate limits on that service definition.

Without proper handling of 429 responses on the client side, your frontend service will appear to be intermittently failing, even though the backend is functioning correctly and simply enforcing its defined limits.

The next logical step after implementing basic rate limiting is to explore distributed rate limiting strategies that can coordinate limits across multiple instances of a service, often involving external stores like Redis.

Want structured learning?

Take the full Consul course →