Reduce Cosmos DB Latency with the Dedicated Gateway Cache (2026)

The Dedicated Gateway Cache doesn’t just reduce latency; it fundamentally changes where your requests are processed, moving computation closer to your application and away from the core data plane.

Let’s see this in action. Imagine a typical Cosmos DB request for a document. Without the cache, your application SDK makes a REST API call to a Cosmos DB endpoint. This endpoint routes the request to a specific partition of a collection. The data is retrieved, processed (if it’s a query), and then sent back. This involves network hops, serialization/deserialization, and potentially contention on the data plane.

Now, with the Dedicated Gateway Cache, your application SDK is configured to point to the gateway endpoint instead of a direct Cosmos DB endpoint.

{
  "cosmosDb": {
    "accountEndpoint": "https://my-cosmos-gateway.documents.azure.com:443/",
    "databaseId": "myDatabase",
    "containerId": "myContainer"
  }
}

When a request arrives at the gateway, it first checks its local cache. If the data is present and valid, it’s returned immediately from the gateway node itself. This bypasses the entire data plane trip.

If the data isn’t in the cache, the gateway then forwards the request to the appropriate partition on the Cosmos DB data plane. The response is then sent back to the application and stored in the gateway’s cache for future requests. This "cache-aside" pattern ensures that subsequent identical requests are served from the gateway.

The problem this solves is the inherent latency introduced by network round trips and the distributed nature of Cosmos DB. For read-heavy workloads, especially those with predictable access patterns, the data plane is often underutilized for reads but still incurs the cost of network latency and routing. The Dedicated Gateway Cache offloads these read operations to a dedicated, highly available compute layer that’s geographically closer to your application.

Internally, the cache uses a Least Recently Used (LRU) eviction policy. Each gateway node maintains its own cache, meaning data is replicated across multiple gateway nodes if multiple requests for the same data hit different nodes. The cache keys are derived from the request URI and headers, ensuring that specific versions or queries are cached appropriately. The TTL (Time To Live) for cached items is determined by the _ts (timestamp) property of the Cosmos DB document and the x-ms-max-buffered-item-count and x-ms-max-buffered-item-lifetime headers that can be sent with requests.

A key aspect often overlooked is how the cache interacts with strong consistency. For stronger consistency levels, the cache might have a shorter effective TTL or be more prone to cache misses if the underlying data changes frequently. The gateway cache is most effective for eventual consistency or bounded staleness, where a small degree of data staleness is acceptable in exchange for significantly lower latency.

The next step after optimizing reads with the gateway cache is understanding how to manage write latency for highly transactional workloads.