Cosmos DB’s multi-region write feature, while incredibly powerful for global availability, can sometimes lead to write conflicts. This happens when two or more clients attempt to update the same document concurrently in different regions, and the conflict resolution policy can’t automatically reconcile the changes.

Let’s see this in action. Imagine two users, one in New York and one in London, both trying to update the quantity of the same product in an inventory database.

// Document before concurrent writes
{
  "id": "product-123",
  "name": "Wireless Mouse",
  "quantity": 10
}

User in New York: Reads quantity = 10. Decrements quantity by 1. Writes quantity = 9. User in London: Reads quantity = 10. Decrements quantity by 1. Writes quantity = 9.

Without a proper conflict resolution strategy, Cosmos DB might end up with quantity = 9 in one region and quantity = 9 in another, but the system doesn’t inherently know which write was "correct" or how to merge them if other fields were also changed. This leads to a write conflict.

The core problem Cosmos DB solves with multi-region writes is maintaining low-latency data access and high availability for a globally distributed user base. When a write happens in a region, it needs to be propagated to other regions. If two writes targeting the same resource occur in different regions before the first write has been fully replicated, a conflict arises.

Cosmos DB has a built-in mechanism to handle these conflicts: Conflict-Free Replicated Data Types (CRDTs) and a configurable conflict resolution policy. When a conflict is detected, Cosmos DB applies the chosen policy to determine which version of the document to keep. The default policy is "Last Writer Wins" (LWW).

With LWW, the system uses the _ts (timestamp) property of the document to decide which version to accept. The document with the higher _ts value is considered the latest and is propagated across all regions.

Diagnosis:

You’ll typically see errors like 409 Conflict or specific messages indicating a write conflict in your application logs or when querying the _conflicts endpoint.

To diagnose, you can query the _conflicts container, which is automatically created if your database has multi-region writes enabled.

az cosmosdb gremlin query --account-name <your_cosmosdb_account_name> --database-name <your_database_name> --query-stop-at-first-document true --query "g.V().hasLabel('cosmos-db-conflict').has('id', 'product-123')"

This query will return documents representing the conflicting versions, showing their _ts values and the differing properties.

Common Causes and Fixes:

  1. Default Last Writer Wins (LWW) Insufficient for Business Logic:

    • Diagnosis: You observe that the LWW policy is overwriting valid business logic changes. For instance, if one write increments a counter and another decrements it, LWW might arbitrarily pick one, losing the intended aggregate.
    • Fix: Implement a custom conflict resolution policy. This is done at the database level. You can define a stored procedure that Cosmos DB will execute when a conflict is detected. The stored procedure receives the conflicting versions as input and returns the resolved version.
      // Example stored procedure for custom resolution (e.g., summing quantities)
      function resolve(left, right) {
          var resolved = left; // Assume left is the one to keep initially
          if (left._ts < right._ts) {
              resolved = right; // If right is newer, start with right
          }
      
          // Custom logic: If both documents have a 'quantity' field, sum them
          if (resolved.hasOwnProperty('quantity') && left.hasOwnProperty('quantity') && right.hasOwnProperty('quantity')) {
              resolved.quantity = left.quantity + right.quantity;
          }
          // Add other custom resolution logic here for different fields
      
          // Ensure _ts is set to the higher value to maintain LWW behavior if custom logic doesn't dictate otherwise
          resolved._ts = Math.max(left._ts, right._ts);
      
          return resolved;
      }
      
      You then set this stored procedure as the conflict resolver for your container.
    • Why it works: This gives you explicit control over how conflicting updates are merged, ensuring that your specific business rules are applied even when concurrent writes occur.
  2. High Concurrency on the Same Document:

    • Diagnosis: The application is experiencing frequent 409 Conflict errors, indicating many concurrent writes to the same items.
    • Fix: Optimize your application logic to minimize concurrent writes to the same document. This could involve:
      • Batching: Grouping related updates into a single document write.
      • Optimistic Concurrency Control (OCC): Using ETags. When reading a document, you get its ETag. When writing, you include the ETag. If the ETag on the server doesn’t match the one you sent, it means the document has changed, and your write will fail, allowing your application to retry or handle the conflict. Cosmos DB automatically manages ETags.
      • Partition Key Design: Ensure your partition keys are well-distributed to avoid "hot partitions" where a single partition receives a disproportionate amount of traffic, increasing the likelihood of conflicts within that partition.
    • Why it works: Reduces the probability of two or more clients attempting to modify the same data simultaneously. OCC explicitly signals when a conflict has occurred at the application level, giving you a chance to react before data is lost.
  3. Network Latency and Replication Lag:

    • Diagnosis: Conflicts occur sporadically, even with seemingly low application-level concurrency. This might be due to transient network issues or periods of high replication lag between regions.
    • Fix: Implement retry logic in your application with exponential backoff. When a 409 Conflict error is received, wait for a short, increasing period before retrying the operation.
      // Example C# retry logic
      try
      {
          await container.CreateItemAsync(item);
      }
      catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.Conflict)
      {
          // Implement exponential backoff here
          await Task.Delay(TimeSpan.FromMilliseconds(Math.Pow(2, retryCount) * 100));
          retryCount++;
          // Retry the operation
      }
      
    • Why it works: Gives the system time for replication to catch up and for the conflicting write to be resolved before retrying the original operation.
  4. Incorrect Timestamp (_ts) Handling (Rare):

    • Diagnosis: The _ts value is being manipulated or incorrectly interpreted by the application, leading to unexpected LWW behavior. This is highly unlikely as _ts is managed by Cosmos DB.
    • Fix: Ensure your application code does not attempt to read, modify, or write the _ts field. Treat it as an internal, read-only property managed by the database.
    • Why it works: Prevents accidental interference with the core mechanism Cosmos DB uses for LWW conflict resolution.
  5. Stale Reads Leading to Stale Writes:

    • Diagnosis: An application reads data, performs some computation, and then writes it back. By the time the write occurs, another write might have already happened, causing a conflict. This is a classic race condition.
    • Fix: Use Optimistic Concurrency Control (OCC) by checking the _etag property. When you read a document, Cosmos DB returns an _etag. Include this _etag in your subsequent update operations. If the document has been modified since you read it, the _etag will have changed, and the write will fail with a 412 Precondition Failed (or sometimes a 409 Conflict depending on the exact scenario and SDK version). Your application can then re-read the latest version and re-apply its changes.
      // When updating, include the ETag in the request
      {
        "id": "product-123",
        "name": "Wireless Mouse",
        "quantity": 9,
        "_etag": "\"00000000-0000-0000-0000-000000000001\"" // Example ETag
      }
      
    • Why it works: OCC explicitly detects if the data you’re trying to update has been changed by another process, preventing accidental overwrites and forcing your application to re-evaluate its changes against the latest data.
  6. Under-provisioned Request Units (RUs) during Peaks:

    • Diagnosis: While not directly a "write conflict" in terms of data versions, severe RU throttling can manifest as write failures and retries, which can increase the likelihood of actual data conflicts occurring when operations eventually succeed.
    • Fix: Monitor your RU consumption and provision enough RUs for your workload, especially during peak times. Consider using autoscale RU settings.
      # Check RU consumption for a container
      az cosmosdb sql container show --account-name <your_account> --database-name <your_db> --name <your_container> --resource-group <your_rg> --query "resource.อัตoscale.maxThroughput"
      
    • Why it works: Ensures that your database can handle the throughput of your operations, reducing throttling and the associated retry storms that can exacerbate concurrency issues.

The next hurdle you’ll likely encounter after resolving write conflicts is understanding how to optimize query performance across multiple regions, especially when dealing with eventual consistency guarantees.

Want structured learning?

Take the full Cosmos-db course →