Cosmos DB is throttling your requests because you’re exceeding the provisioned throughput, leading to 429 Too Many Requests errors.

Common Causes and Fixes for Cosmos DB 429 Errors

  1. Under-provisioned Request Units (RUs) for your workload: This is the most frequent culprit. Your database operations (reads, writes, queries) consume RUs, and if the total consumption consistently exceeds your provisioned throughput, Cosmos DB will throttle you.

    • Diagnosis: Monitor the RU Consumption metric in the Azure portal for your container. Look for spikes that correlate with your application’s error logs. You can also use the x-ms-cosmos-server-used-request-units header in your responses during periods of high traffic to see how many RUs a specific operation consumed.
    • Fix: Increase the provisioned RUs for the container or database. If your workload is highly variable, consider using Autoscale.
      • Manual Scale: In the Azure portal, navigate to your Cosmos DB account -> your database -> your container. Under "Scale," change "Manual Throughput" to a higher value (e.g., from 400 RU/s to 1000 RU/s).
      • Autoscale: Select "Autoscale" and set the maximum RU/s (e.g., 4000 RU/s). Cosmos DB will scale between 10% of the maximum and the maximum automatically.
    • Why it works: By increasing the available RUs, you provide more capacity for your operations, preventing them from hitting the provisioned limit and triggering throttling.
  2. Inefficient Queries or Operations: A single, complex query or a high volume of small, inefficient operations can consume a disproportionate amount of RUs.

    • Diagnosis: Use the Azure portal’s "Performance" tab for your container to examine "Top Queries" by RU consumption. Analyze queries that are frequently appearing or consuming a high percentage of RUs. Check application logs for frequent, small read/write operations on the same items.
    • Fix: Optimize your queries. This might involve:
      • Adding appropriate indexing (e.g., ensure you’re indexing fields used in WHERE clauses, ORDER BY, and JOIN).
      • Avoiding SELECT * and only projecting the fields you need.
      • Breaking down complex queries into smaller, more manageable ones.
      • For frequently read items, consider using a "read-heavy" pattern where you fetch the item once and cache it in your application, rather than repeatedly querying for it.
      • Example Indexing Fix: If you frequently query by categoryId, ensure your indexing policy includes an index for it. For example, in the indexing policy JSON, you might have:
        {
            "indexingMode": "consistent",
            "automatic": true,
            "includedPaths": [
                { "path": "/*" }
            ],
            "excludedPaths": [
                { "path": "/\"_etag\"/$*" }
            ],
            "compositeIndexes": [],
            "spatialIndexes": []
        }
        
        To optimize for categoryId, you’d add an index for it, or if you already have /*, ensure it’s not being overridden by an exclusion. More specific indexing:
         "includedPaths": [
            { "path": "/categoryId" }, // Example for a specific field
            { "path": "/*" } // Catch-all if needed, but specific is better
        ]
        
    • Why it works: More efficient queries and operations consume fewer RUs per execution, reducing the overall RU pressure on the system and making your provisioned throughput last longer.
  3. Hot Partitioning: If your data is not evenly distributed across partitions, one or a few partitions can become a bottleneck, consuming all the RUs allocated to that partition and causing throttling, even if overall RU consumption is low.

    • Diagnosis: Monitor the "Storage" tab in the Azure portal. Look for one or more partitions showing significantly higher storage or request rates compared to others. You can also use the x-ms-cosmos-partitionkeyrangeid header in your responses to identify which partition an operation hit.
    • Fix: Re-evaluate your partition key strategy. Choose a partition key with high cardinality and that distributes requests evenly. If you’ve identified a hot partition and cannot change the key, you might need to split your data into new containers with a better partition key or re-partition the data if your database supports it.
      • Example Partition Key Change (Conceptual): If your current partition key is userId and you have a few very active users, consider a composite partition key or a different key that distributes load better, like deviceId or a hashed version of userId. This typically involves creating a new container with the new partition key and migrating data.
    • Why it works: A well-chosen partition key ensures that requests and data are spread across multiple physical partitions, allowing for better parallelization and higher overall throughput.
  4. Concurrency Issues in Application Code: Multiple threads or processes in your application making requests simultaneously can overwhelm the provisioned RUs, especially if they are not properly managed.

    • Diagnosis: Use application performance monitoring (APM) tools to identify high concurrency in your database access layer. Look for patterns of many simultaneous requests hitting Cosmos DB.
    • Fix: Implement client-side throttling or backoff strategies. Use the max-idle-time-in-connection-pool setting in your SDK to manage connection resources effectively. Consider using a distributed queue or message broker to serialize or batch requests if strict ordering or high concurrency is causing issues.
      • SDK Configuration Example (Python): When initializing the Cosmos DB client, you can configure retry policies. The default policy often includes a retry with backoff for 429s. Ensure it’s enabled and not overridden.
      from azure.cosmos import CosmosClient, PartitionKey
      from azure.cosmos.exceptions import CosmosHttpResponseError
      
      client = CosmosClient(
          'YOUR_COSMOS_DB_ENDPOINT',
          'YOUR_COSMOS_DB_KEY',
          # Default retry policy is usually sufficient, but you can customize
          # connection_policy=ConnectionPolicy(
          #     enable_endpoint_discovery=True,
          #     response_timeout_in_seconds=30,
          #     enable_tcp_connection_pooling=True,
          #     max_idle_time_in_connection_pool=30 # seconds
          # )
      )
      
    • Why it works: Client-side strategies manage the rate at which your application sends requests, preventing it from overwhelming Cosmos DB and allowing the system to serve requests within its capacity.
  5. Large Batch Operations: Performing very large batch operations (e.g., ExecuteBulkOperations or multiple Upsert calls in a single request) can spike RU consumption significantly.

    • Diagnosis: Review your application code for any logic that performs bulk inserts, updates, or deletes. Analyze the size of these batches and their impact on RU consumption metrics.
    • Fix: Break down large batch operations into smaller, sequential batches. Introduce delays between batches if necessary.
      • Example Batching Fix: Instead of attempting to upsert 1000 items at once, split it into 10 batches of 100 items each, with a small delay (e.g., 50-100ms) between each batch.
    • Why it works: Smaller, staggered batches smooth out the RU consumption, preventing a single, massive spike that could trigger throttling.
  6. Background Tasks or Unaccounted for Traffic: Sometimes, background processes, indexing operations, or even other applications sharing the same Cosmos DB account might be consuming RUs without your direct knowledge.

    • Diagnosis: Check the "Metrics" section for RU Consumption at the account level, not just the container level. Look for consistent, baseline RU usage that might not align with your primary application’s activity. Review any scheduled jobs or other services that interact with the database.
    • Fix: Identify the source of the background traffic. If it’s necessary, provision additional RUs for it. If it’s unintentional, disable or reconfigure the source. Consider dedicated throughput for different workloads if you have multiple applications accessing the same account.
    • Why it works: Understanding and accounting for all RU consumption ensures that your provisioned throughput is sufficient for all active processes.

The next error you’ll likely encounter after resolving 429s is 503 Service Unavailable if you’ve provisioned throughput that’s still too low or if there are underlying network issues between your application and Cosmos DB.

Want structured learning?

Take the full Cosmos-db course →