Cosmos DB diagnostic logs are actually a surprisingly flexible, but often overlooked, tool for deep performance analysis, not just general troubleshooting.
Let’s watch Cosmos DB in action, specifically how it logs query and Request Unit (RU) information. Imagine a common scenario: your application is experiencing intermittent slowdowns, and you suspect it’s related to inefficient queries or hitting RU limits.
Here’s a sample of what a diagnostic log entry for a query operation might look like:
{
"time": "2023-10-27T10:30:00.123Z",
"resourceId": "/subscriptions/YOUR_SUB_ID/resourceGroups/YOUR_RG/providers/Microsoft.DocumentDB/databaseAccounts/YOUR_COSMOS_ACCOUNT/databases/YOUR_DATABASE/collections/YOUR_COLLECTION",
"operationName": "Query",
"category": "QueryStoreRuntime",
"properties": {
"activityId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"durationMs": 150,
"requestCharge": 10.5,
"documentCount": 5,
"urls": [
{
"url": "/dbs/YOUR_DATABASE/colls/YOUR_COLLECTION/docs",
"verb": "POST",
"status": 200
}
],
"requestContent": {
"query": "SELECT * FROM c WHERE c.category = 'electronics' AND c.price < 100",
"parameters": []
},
"responseContent": {
"totalRecords": 5,
"maxItemCount": 100
},
"clientInfo": {
"userAgent": "Azure-Cosmos-Java-SDK/3.25.0"
}
}
}
And here’s what a Write operation (which also consumes RUs and can be a bottleneck) might log:
{
"time": "2023-10-27T10:31:05.456Z",
"resourceId": "/subscriptions/YOUR_SUB_ID/resourceGroups/YOUR_RG/providers/Microsoft.DocumentDB/databaseAccounts/YOUR_COSMOS_ACCOUNT/databases/YOUR_DATABASE/collections/YOUR_COLLECTION",
"operationName": "Write",
"category": "DataPlaneRequests",
"properties": {
"activityId": "f0e9d8c7-b6a5-4321-0987-fedcba098765",
"durationMs": 80,
"requestCharge": 5.2,
"documentId": "some-document-id-123",
"verb": "POST",
"statusCode": 201,
"clientInfo": {
"userAgent": "Azure-Cosmos-Python-SDK/4.3.0"
}
}
}
This gives you the raw data to understand what is happening.
The core problem Cosmos DB diagnostic logs solve is providing granular visibility into the performance of individual operations, especially queries and data modifications, and their associated Request Unit (RU) consumption. Without these logs, you’re largely reliant on aggregated metrics in the Azure portal, which often lack the detail needed to pinpoint specific inefficient queries or identify the exact operations causing RU throttling.
How it works internally:
Cosmos DB, as a distributed database, needs to track the cost (in RUs) and latency of every operation. Diagnostic logs are the mechanism by which this detailed, per-request information is surfaced externally. When a client makes a request (e.g., a SELECT query, a POST to create a document, or a PUT to update), the Cosmos DB backend processes it, calculates the RUs consumed, measures the duration, and then logs this information.
The QueryStoreRuntime category is crucial for query analysis. It captures details about query execution, including the query text itself, the number of documents scanned, and the actual RUs consumed by that specific query. The DataPlaneRequests category is broader and logs all data plane operations (Create, Read, Update, Delete, Query) and their RU costs.
Your control levers:
-
Enabling Diagnostic Settings: This is the first step. You need to configure your Cosmos DB account to send these logs to a destination.
- Destination: Azure Storage, Log Analytics workspace, or Event Hubs. Log Analytics is often preferred for its querying capabilities.
- Categories: Select
QueryStoreRuntimeandDataPlaneRequests. You might also considerAuditfor security-related events. - Example Azure CLI command to send to Log Analytics:
az monitor diagnostic-settings create --name "cosmos-diagnostics-settings" \ --resource "/subscriptions/YOUR_SUB_ID/resourceGroups/YOUR_RG/providers/Microsoft.DocumentDB/databaseAccounts/YOUR_COSMOS_ACCOUNT" \ --workspace-id "/subscriptions/YOUR_SUB_ID/resourceGroups/YOUR_RG/providers/Microsoft.OperationalInsights/workspaces/YOUR_WORKSPACE_NAME" \ --logs '[{"category": "QueryStoreRuntime", "enabled": true}, {"category": "DataPlaneRequests", "enabled": true}]'
-
Configuring Retention Policies: In your chosen destination (e.g., Log Analytics), set data retention policies to manage storage costs.
-
Querying the Logs (Log Analytics Example): Once logs are flowing, you use Kusto Query Language (KQL) to analyze them.
- Find slow queries (over 100ms):
AzureDiagnostics | where ResourceProvider == "MICROSOFT.DOCUMENTDB" | where Category == "QueryStoreRuntime" | where OperationName == "Query" | where durationMs_d > 100 | project time_d, durationMs_d, requestCharge_d, query_s = tostring(properties_s.requestContent.query) | order by durationMs_d desc - Identify high RU-consuming queries:
AzureDiagnostics | where ResourceProvider == "MICROSOFT.DOCUMENTDB" | where Category == "QueryStoreRuntime" | where OperationName == "Query" | project time_d, durationMs_d, requestCharge_d, query_s = tostring(properties_s.requestContent.query) | order by requestCharge_d desc | take 20 - Analyze RU consumption by operation type:
AzureDiagnostics | where ResourceProvider == "MICROSOFT.DOCUMENTDB" | where Category == "DataPlaneRequests" | summarize sum(requestCharge_d) by OperationName
- Find slow queries (over 100ms):
The most surprising thing is how often QueryStoreRuntime logs reveal queries that look simple but are performing full collection scans due to missing or incorrect indexing, or are simply structured in a way that forces Cosmos DB to read far more data than necessary. The properties_s.requestContent.query field in the logs is your direct window into the client’s request, and comparing it with the requestCharge_d and durationMs_d is key.
The next hurdle is correlating these diagnostic logs with application-level tracing to understand the context of why a specific query was executed at a particular time.