DynamoDB’s write sharding mechanism failed to distribute write traffic evenly, causing a specific partition to become a bottleneck and reject requests.
The most common culprit is simply having a very hot key. This isn’t about the total write throughput of your table, but about the rate at which a single partition is being written to. DynamoDB partitions itself based on your partition key and sort key. If many items share the same partition key, or if your partition key design is such that writes naturally cluster, you’ll hit this.
Diagnosis:
Check your CloudWatch metrics for WriteThrottleEvents. If this number is consistently high, you’re being throttled. Look at the PartitionKey dimension within the WriteThrottleEvents metric. This will tell you which partition key is causing the problem. If you see a specific partition key with a disproportionately high number of throttled writes, that’s your hot partition.
Cause 1: Poor Partition Key Choice
You’ve chosen a partition key that has low cardinality or an uneven distribution of values. For example, using a boolean flag like isActive as a partition key means only two possible values, and if most items are true, one partition will be overloaded.
Diagnosis Command:
aws dynamodb scan --table-name YourTableName --select COUNT --filter-expression "partitionKeyAttribute = :pkval" --expression-attribute-values '{":pkval": {"S": "the_hot_partition_key_value"}}'
Run this for the suspected hot partition key value. If you get a very high count compared to other potential partition key values, this is your problem.
Fix:
Re-architect your schema. The best solution is to choose a partition key with high cardinality and even distribution. If you cannot change the partition key, consider adding a "sharding" attribute to your items. For example, if your partition key is userId, and you have a few users with millions of items, you could add a shardId attribute (e.g., 0 to 15) and include that in your primary key (making it a composite key) or use it in your query. This distributes writes across multiple partitions even for the same userId.
Why it works: Distributes writes across multiple physical partitions, preventing any single partition from exceeding its provisioned throughput.
Cause 2: Time-Series Data with Sequential Keys If you’re using a timestamp or a monotonically increasing ID as your partition key (or part of it), all new writes will go to the same partition.
Diagnosis:
Look at your WriteThrottleEvents metric. If the throttling occurs consistently with new data being written, and your partition key is a timestamp or sequential ID, this is likely it.
Fix:
Introduce a random or hashed prefix to your partition key. For example, if your partition key is timestamp#userId, change it to hash(userId)#timestamp#userId or use a fixed-size random prefix based on the number of shards you want. A common approach is to use a modulo operation on a counter or a hash of the user ID to generate a shard identifier, then prepend it to your actual partition key. For instance, if you want 16 shards, you could compute shard_id = hash(userId) % 16 and use shard_id:userId as your partition key.
Why it works: The random or hashed prefix spreads writes across many partitions, even if the timestamp is the same.
Cause 3: High Read/Write Ratio on a Single Item While this problem is about write throttling, it’s possible that a single item is being updated extremely frequently (e.g., a counter or a status flag) and this item resides in a partition that is also receiving other writes.
Diagnosis: Examine your application logic. Are there any items that are updated much more frequently than others? If so, investigate the partition key of that item.
Fix: If it’s a counter, use DynamoDB’s atomic counters, but be aware that frequent updates to the same item can still cause throttling if that item’s partition is hot. For very high-frequency updates on a single logical entity, consider a different data model. Perhaps denormalize the data or use a separate table with a different sharding strategy for high-traffic counters. Another approach is to introduce a small, random delay or a jitter before updating the item.
Why it works: While not directly sharding the item, it can help smooth out bursty write traffic to that specific partition.
Cause 4: Insufficient Provisioned Throughput This is the most basic cause, but often overlooked when focusing on sharding. Your table or index might simply not have enough provisioned write capacity units (WCUs) to handle the aggregate write load, let alone a hot partition.
Diagnosis:
Compare your WriteThrottleEvents to your provisioned WriteCapacityUnits. If WriteThrottleEvents is high and your ConsumedWriteCapacityUnits is consistently close to your provisioned WriteCapacityUnits, you need more capacity.
Fix: Increase your provisioned write capacity units for the table or the specific index experiencing throttling. If using On-Demand mode, ensure your account has not hit any aggregate throttling limits and that DynamoDB can scale up capacity for your table.
Why it works: Directly provides more throughput to the system.
Cause 5: Aggressive Retry Logic Client-side retry logic that is too aggressive or not implemented with exponential backoff can exacerbate throttling. A burst of throttled requests can lead to a flood of retries hitting the same hot partition, making the problem worse.
Diagnosis: Monitor your application logs for retry attempts and their frequency. Observe if throttled requests are immediately retried without a significant delay.
Fix: Implement or tune your client’s retry mechanism to use exponential backoff with jitter. This means retrying after increasing intervals (e.g., 50ms, 100ms, 200ms, etc.) and adding a small random delay to avoid synchronized retries from multiple clients.
Why it works: Reduces the effective write rate by spacing out retries, giving the hot partition time to recover.
Cause 6: Hot Secondary Indexes Throttling can also occur on secondary indexes if the access patterns on the index cause a hot partition within the index itself. This is less common than hot partitions on the primary table but is a real possibility.
Diagnosis:
Check CloudWatch metrics for your Global Secondary Indexes (GSIs) or Local Secondary Indexes (LSIs). Look for WriteThrottleEvents specifically on the index. Examine the PartitionKey dimension for that index.
Fix: Similar to the primary table, re-evaluate the partition key design for your secondary index. If the index uses a composite key, consider how the partition key and sort key combinations are being accessed. If the problem is due to a specific GSI, you might need to provision higher capacity for that GSI or redesign it.
Why it works: Addresses the bottleneck at the index level, similar to how addressing the primary table’s partition key works.
After fixing hot partitions, you might encounter ProvisionedThroughputExceededException on reads if your read patterns are also concentrated, or you might see a spike in ConsumedWriteCapacityUnits as your previously throttled writes now consume capacity.