DynamoDB metrics are actually a pull system, not a push system, meaning CloudWatch polls DynamoDB for data at intervals, rather than DynamoDB sending data to CloudWatch.
Let’s see this in action. Imagine you have a DynamoDB table named my-users-table and you want to monitor its read capacity.
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedReadCapacityUnits \
--dimensions Name=TableName,Value=my-users-table \
--start-time 2023-10-26T10:00:00Z \
--end-time 2023-10-26T11:00:00Z \
--period 300 \
--statistics Sum
This command fetches the total ConsumedReadCapacityUnits for my-users-table over a 5-minute period within the specified hour. The output will look something like this:
{
"Datapoints": [
{
"Timestamp": "2023-10-26T10:05:00Z",
"Sum": 1234.5,
"Unit": "Count"
},
{
"Timestamp": "2023-10-26T10:10:00Z",
"Sum": 1567.8,
"Unit": "Count"
}
// ... more datapoints
],
"Label": "ConsumedReadCapacityUnits"
}
This shows you how CloudWatch collects data points at regular intervals (defined by --period). The Sum statistic here represents the total consumed units within that 5-minute window.
The core problem DynamoDB monitoring with CloudWatch solves is providing visibility into your table’s performance and cost. Without it, you’re flying blind, potentially experiencing performance degradation or unexpected billing spikes without knowing why.
Internally, DynamoDB exposes a set of metrics through the AWS API. CloudWatch, acting as a client, periodically queries these APIs for specific metrics across your resources. These metrics are grouped by service (AWS/DynamoDB) and then by resource (e.g., TableName). You can then aggregate, visualize, and set alarms on these metrics.
The key levers you control are:
- Metrics Selection: Which metrics you choose to monitor. Essential ones include
ConsumedReadCapacityUnits,ConsumedWriteCapacityUnits,ThrottledRequests,SuccessfulRequestLatency, andSystemErrors. - Dimensions: How you filter metrics. For DynamoDB, the primary dimension is
TableName. You can also monitor at the global table level using theGlobalTableNamedimension. - Statistics: How you aggregate data over time. Common statistics are
Sum(total count),Average(mean value),Maximum(peak value), andMinimum(lowest value). For capacity units,SumorAverageare typical. For latency,AverageorMaximumare more informative. - Periods: The granularity of data collection. Shorter periods (e.g., 60 seconds) give finer-grained insights but can increase data storage costs. Longer periods (e.g., 300 seconds) are more cost-effective for trend analysis.
- Alarms: Thresholds that trigger notifications when a metric breaches a certain level. This is where proactive management happens.
A common misconception is that all CloudWatch metrics are real-time. For DynamoDB, the data is typically available within minutes, but there’s an inherent polling delay. CloudWatch polls DynamoDB every minute for most metrics, but the data is then aggregated into periods (e.g., 5 minutes). So, while you might see a spike, the metric reflecting that spike might not appear in CloudWatch for a few minutes. This delay is why setting alarms with appropriate thresholds is crucial; you don’t want to react after the problem has already caused significant user impact.
The next concept to explore is how to effectively use these metrics to set up automated scaling policies for provisioned throughput, or to trigger more granular error handling based on specific metric patterns.