Rate limiting LLM APIs is less about preventing abuse and more about ensuring equitable resource allocation so that one user’s massive query doesn’t starve everyone else out of the GPU cluster.

Let’s watch a hypothetical LLM API, llm-api.example.com, handle incoming requests. Imagine we have a simple rate limiter in front of it.

{
  "rate_limit": {
    "requests_per_minute": 100,
    "burst_capacity": 200
  }
}

A user, identified by their API key, sends 150 requests in the first 30 seconds.

  1. The first 100 requests arrive. The rate limiter allows them through immediately, consuming 100 of the available "tokens" for the minute.
  2. The next 50 requests arrive within the same 30-second window.
    • The rate limiter has 0 tokens remaining from the requests_per_minute allowance.
    • However, it has burst_capacity of 200. The user has only used 100 tokens so far, so they have 100 "burst" tokens available.
    • The limiter allows 100 of these 50 requests through, consuming the remaining burst capacity.
    • The remaining 0 requests are rejected.
  3. Now, the minute ticks over. The rate limiter resets its requests_per_minute allowance back to 100 and replenishes the burst_capacity based on the algorithm (often a token bucket). If the user’s requests continue at a high rate, they’ll hit the limit again as soon as they’ve consumed their replenished allowance.

This mechanism, often implemented as a "token bucket" algorithm, is the core of most rate limiting. Each incoming request "consumes" a token. Tokens are added to the bucket at a steady rate (e.g., 100 per minute). The bucket has a maximum capacity (burst_capacity), preventing a sudden flood of requests from overwhelming the system even if tokens are available. When the bucket is empty, requests are rejected until tokens are replenished.

The primary problem rate limiting solves is fairness. Without it, a single user making a massive, unconstrained number of requests could consume all available GPU processing time, leading to high latency or outright unavailability for all other users. This is especially critical for LLMs, which are computationally expensive. Rate limiting ensures that each user gets a predictable slice of the available resources.

The configuration above is quite basic. Real-world LLM APIs often use more sophisticated strategies:

  • Per-User/Per-API Key Limits: The most common approach. Each unique API key gets its own bucket.
  • Global Limits: A hard cap on the total requests the entire API service can handle, regardless of user.
  • Tiered Limits: Different users (e.g., free vs. premium) get different requests_per_minute and burst_capacity values.
  • Feature-Specific Limits: Different limits for different LLM models (e.g., a smaller, faster model might have higher limits than a massive, state-of-the-art one).
  • Distributed Rate Limiting: When an API is served by multiple instances, the rate limiting state needs to be shared and synchronized across all instances, typically using a distributed cache like Redis.

Consider a scenario where you want to limit a specific model, llama-2-70b, to 60 requests per minute per API key, with a burst of 120. This configuration would likely be applied within the API gateway or the service managing requests before they hit the LLM inference engine.

# Example configuration snippet for an API Gateway
routes:
  - path: /v1/models/llama-2-70b/completions
    rate_limit:
      key_extractor: request.header.x-api-key # How to identify the user
      requests_per_minute: 60
      burst_capacity: 120

The key_extractor is crucial. It tells the rate limiter how to group requests. For LLM APIs, this is almost always based on an authentication token or API key found in a request header like Authorization or x-api-key. Without a reliable way to identify the user, you can only implement global limits, which isn’t fair.

The burst_capacity isn’t just a buffer; it’s a mechanism to allow for natural variations in request patterns. For instance, if a user’s application needs to send 10 requests in rapid succession to perform a complex task, a burst capacity allows this without immediately triggering a rate limit violation, as long as their average rate over time stays within the requests_per_minute limit. This makes the API feel more responsive for legitimate, albeit momentarily spiky, usage.

Many rate-limiting implementations don’t explicitly "reset" tokens on a minute boundary. Instead, they use a sliding window algorithm. In a sliding window, the system tracks requests over a rolling time period (e.g., the last 60 seconds). If a request arrives, the system looks back 60 seconds from that precise moment and counts how many requests were made by that user. If the count exceeds the limit, the request is rejected. This prevents "bursts" right at the boundary of a fixed minute, offering a more consistent enforcement.

The most common error message users see when hitting a rate limit is 429 Too Many Requests. This HTTP status code is the standard way to signal that the client has exceeded its allowed request rate. Often, the response headers will include Retry-After, indicating how many seconds the client should wait before making another request.

The next challenge after implementing basic rate limiting is handling the 429 responses gracefully in client applications, often involving exponential backoff strategies.

Want structured learning?

Take the full AI Security course →