Cut Cosmos DB Read Costs with the Integrated Cache (2026)

The integrated cache in Azure Cosmos DB can dramatically reduce read costs, but it’s not a magic bullet; it’s a smart optimization for specific access patterns.

Let’s see it in action. Imagine you have a popular product catalog where the same product details are requested repeatedly.

// Example product document
{
  "id": "PROD-12345",
  "name": "Super Widget 5000",
  "description": "The ultimate widget for all your widget needs.",
  "price": 99.99,
  "category": "Widgets",
  "tags": ["popular", "new", "heavy-duty"]
}

Without the cache, every GET request for PROD-12345 hits your provisioned throughput, consuming Request Units (RUs). If this product is viewed 10,000 times a day, and a read costs 1 RU, that’s 10,000 RUs per day just for one product.

With the integrated cache enabled, the first read for PROD-12345 still consumes 1 RU and lands in the cache. Subsequent reads for the exact same document within the cache’s TTL (Time To Live) are served directly from the cache, consuming zero RUs. The cost saving is immediate and significant for frequently accessed, relatively static data.

The integrated cache is an optional, opt-in feature for Cosmos DB API for NoSQL accounts. When enabled, it sits between your application and the data plane. It’s a read-only cache that stores a subset of your container’s data.

Here’s how it works internally:

Request Interception: Incoming read requests (point reads, queries) are first checked against the integrated cache.
Cache Hit: If the requested data is present in the cache and its TTL hasn’t expired, the data is served directly from the cache. This is a cache hit.
Cache Miss: If the data is not in the cache, or the cached entry has expired, the request is forwarded to the data plane. This is a cache miss.
Data Retrieval & Caching: The data plane retrieves the data. A portion of this data, based on a configurable percentage, is then written into the integrated cache.
Response to Client: The retrieved data is sent back to the client.

The primary benefit is cost reduction. By serving reads from the cache, you drastically reduce the number of RUs consumed by read operations. This means you can either provision less throughput (and pay less) or handle more read traffic with your existing provisioned throughput.

The key levers you control are:

Cache TTL: This determines how long an item stays in the cache. A longer TTL means data stays available longer but might become stale if the underlying data changes. A shorter TTL ensures fresher data but reduces the chance of a cache hit. Values range from 1 minute to 24 hours.
Cache Population Percentage: This dictates what percentage of your container’s data is eligible to be cached. You can set this from 0% to 100%. Caching 100% of a massive container might not be feasible or cost-effective if only a small subset is frequently read.

The integrated cache is most effective for workloads with a high read-to-write ratio and a significant amount of frequently accessed, relatively static data. Think product catalogs, user profiles, configuration settings, or leaderboards where updates are less frequent than reads.

The cache doesn’t store query results directly; it caches individual documents. When a query is executed, Cosmos DB retrieves the documents that match the query criteria from the cache (if available) and then stitches them together to form the query result. This means if a query requires many documents and only a few are cached, you’ll still experience cache misses for the missing documents.

When you enable the integrated cache, you need to be mindful of how your data changes. If you update a document that’s currently in the cache, the cache doesn’t automatically invalidate that specific entry. Instead, it relies on the TTL to eventually expire the stale data and fetch the updated version on the next cache miss. This means you must tune your TTL to be shorter than your acceptable data staleness window.

The next challenge you’ll encounter is understanding how the cache behaves with different query patterns, especially those that involve projections or large numbers of documents.