Cosmos DB Autoscale vs Manual RU: Choose the Right Throughput Model (2026)

Autoscale throughput in Cosmos DB doesn’t just scale up and down; it actively predicts your workload and adjusts provisioned Request Units (RUs) proactively to avoid throttling, even before you notice a spike.

Let’s see autoscale in action. Imagine a simple web application using Cosmos DB for user profiles.

{
  "id": "user123",
  "name": "Alice Smith",
  "email": "alice.smith@example.com",
  "settings": {
    "theme": "dark",
    "notifications": true
  }
}

A single read operation on this document costs 1 RU. A write operation, like updating the theme, costs 2 RUs.

If our application is serving 100 users concurrently, and each reads their profile, that’s 100 RUs. If 10 of them update their theme simultaneously, that’s an additional 20 RUs, for a total of 120 RUs in that burst.

With autoscale configured for a maximum of 4000 RU/s (meaning it can scale between 400 and 4000 RU/s), Cosmos DB’s internal engine monitors these requests. When it sees the workload climb towards 120 RUs, it might already be pre-allocating slightly more capacity, anticipating the next request. If the workload consistently hits, say, 3500 RUs during peak hours, autoscale will scale up to 4000 RU/s. As traffic drops to 500 RUs in off-peak hours, it scales down to 500 RU/s. This dynamic adjustment is managed automatically.

The core problem this solves is the trade-off between over-provisioning (wasting money on unused capacity) and under-provisioning (experiencing request throttling, leading to application errors and poor user experience). Autoscale aims to hit the sweet spot by dynamically matching provisioned throughput to actual demand.

Internally, Cosmos DB uses a sophisticated predictive algorithm. It analyzes historical RU consumption patterns, looking at factors like time of day, day of week, and recent traffic surges. Based on this, it forecasts future RU needs. When it predicts an increase in demand, it gradually scales up the provisioned RUs. Conversely, when demand is expected to decrease, it scales down. This is not a reactive "scale when I’m throttled" system; it’s designed to be proactive.

The key levers you control are the minimum and maximum RU/s. The minimum defines the floor for your autoscale throughput, ensuring a baseline capacity for idle periods. The maximum sets the ceiling, dictating the highest throughput your database can achieve. For example, configuring autoscale with a minimum of 1000 RU/s and a maximum of 10,000 RU/s means your throughput will fluctuate between 1000 and 10,000 RU/s, with actual provisioned RUs being 1/10th of the current target (e.g., if targeting 4000 RU/s, you’ll have 400 provisioned).

What most people miss is that the billing for autoscale is based on the highest RU/s provisioned at any given moment, not the average. So, if your autoscale scales from 1000 RU/s to 4000 RU/s and back down within an hour, you’re billed for having 4000 RU/s available for the duration it was at that peak. This is why setting a realistic maximum is crucial; you don’t want to pay for capacity you’ll never realistically need, even if it’s only for a few minutes.

Understanding the RU consumption of your operations and predicting your peak load are vital for correctly setting autoscale limits.