Cosmos DB SDKs are designed to be resilient, but that doesn’t mean your application code can be passive about network hiccups or transient service issues.

Let’s see this in action. Imagine you’re trying to read an item from Cosmos DB.

from azure.cosmos import CosmosClient
from azure.core.exceptions import CosmosHttpResponseError
import time

# Replace with your actual endpoint and key
client = CosmosClient("https://your-cosmosdb-account.documents.azure.com/", "YOUR_COSMOS_DB_KEY")
database = client.get_database_client("your_database")
container = database.get_container_client("your_container")

item_id = "some-item-id"

def read_item_with_retry(container, item_id, max_retries=5, delay_seconds=2):
    for attempt in range(max_retries):
        try:
            item = container.read_item(item=item_id, partition_key=item_id) # Assuming item_id is also the partition key for simplicity
            print(f"Successfully read item {item_id} on attempt {attempt + 1}")
            return item
        except CosmosHttpResponseError as e:
            if e.status_code in [429, 503] and attempt < max_retries - 1:
                print(f"Attempt {attempt + 1} failed with {e.status_code}. Retrying in {delay_seconds} seconds...")
                time.sleep(delay_seconds)
                # Implement exponential backoff for more robust retries
                delay_seconds *= 2
            else:
                print(f"Attempt {attempt + 1} failed with {e.status_code}. No more retries.")
                raise
        except Exception as e:
            print(f"An unexpected error occurred on attempt {attempt + 1}: {e}")
            raise
    return None

# Example usage
try:
    item_data = read_item_with_retry(container, item_id)
    if item_data:
        print("Item data:", item_data)
except Exception as e:
    print("Failed to read item after multiple retries:", e)

This code snippet demonstrates a basic retry mechanism. When a CosmosHttpResponseError occurs with a status code of 429 (Too Many Requests) or 503 (Service Unavailable), it waits for a specified delay_seconds and tries again. If the error persists or is a different type, it gives up.

The core problem Cosmos DB retries solve is handling transient failures. These aren’t bugs in your code or a permanent outage of the database. They are temporary conditions like network blips, momentary resource contention on the Cosmos DB side, or a brief service restart. Without retry logic, your application would incorrectly treat these as permanent failures, leading to service disruptions. The SDK itself has some built-in retry policies, but they are often tuned for general use and might not align with your application’s specific latency tolerance or error handling strategy. You want to manage this at the application level to gain fine-grained control.

The azure.cosmos.CosmosClient and its related container clients are your entry points. When you call methods like read_item, upsert_item, or query_items, the SDK translates these into HTTP requests against the Cosmos DB REST API. These requests can fail for various reasons. The most common transient errors you’ll encounter and want to retry are:

  • 429 Too Many Requests: This is the most frequent. Cosmos DB throttles your requests when you exceed your provisioned Request Units (RUs). The Retry-After header in the response tells you how long to wait.
  • 503 Service Unavailable: Indicates that the Cosmos DB service is temporarily unable to handle the request, often due to internal load balancing or brief maintenance.
  • 408 Request Timeout: The server didn’t receive a complete request message within the time that it was prepared to wait. This can be a network issue between your client and Cosmos DB.
  • 500 Internal Server Error: While often indicative of a persistent problem, some 500 errors can be transient. It’s a judgment call whether to retry these, but for critical operations, a few retries are usually warranted.
  • Network-level errors (e.g., ConnectionRefusedError, SSLError): These are often indicative of underlying network instability or configuration issues between your application and the Cosmos DB endpoint.

The CosmosHttpResponseError is the primary exception you’ll catch. It wraps the HTTP status code and provides details about the error. For retryable errors, you’ll want to inspect e.status_code and potentially e.headers.get('Retry-After').

A naive retry loop that just spins again immediately is bad. It can exacerbate the problem by hammering an already overloaded service. Instead, you need backoff strategies.

Exponential backoff is the standard. You start with a small delay (e.g., 1 second) and double it with each subsequent retry, up to a maximum delay. This gives the service time to recover.

import random

def calculate_backoff_delay(base_delay=1, max_delay=30, attempt=0):
    # Simple exponential backoff with jitter
    delay = min(base_delay * (2 ** attempt), max_delay)
    # Add a small random jitter to avoid thundering herd
    jitter = random.uniform(0, delay * 0.1)
    return delay + jitter

This calculate_backoff_delay function adds a random element (jitter) to the calculated delay. Jitter is crucial in distributed systems. If many clients experience a transient error simultaneously and all retry with the exact same backoff, they can all hit the service again at the same moment, causing another spike in load. Jitter spreads out these retries.

You’ll also want to set a maximum number of retries. This prevents infinite loops and ensures your operation eventually fails fast if the problem persists. A common range is 3-5 retries for most transient errors.

The default SDK retry policy for azure-cosmos is configured in the CosmosClient instantiation. You can override it:

from azure.cosmos import CosmosClient, ConnectionPolicy, ConsistencyLevel

# Custom retry options
custom_policy = ConnectionPolicy()
custom_policy.retry_options.max_retry_attempt_count = 5
custom_policy.retry_options.max_wait_time_in_seconds = 30 # Total time for retries

client = CosmosClient(
    "https://your-cosmosdb-account.documents.azure.com/",
    "YOUR_COSMOS_DB_KEY",
    connection_policy=custom_policy
)

This configures the SDK’s internal retry mechanism. However, for more complex scenarios or when you need to handle specific error codes differently (e.g., logging a 500 error differently than a 429), implementing your own retry loop around SDK calls gives you maximum flexibility. The code example at the beginning shows this manual approach.

Consider what happens when your retry logic itself fails. If the service is down for an extended period, your retry loop will exhaust its attempts. The final exception raised should be handled by your application’s global error handling, perhaps by returning an HTTP 503 to your API consumers or queuing the operation for later.

The next logical step after implementing robust retries is to monitor your Cosmos DB RU consumption and latency metrics closely.

Want structured learning?

Take the full Cosmos-db course →