CloudWatch alarms and dashboards are your eyes and ears into your AWS environment, but they’re not just for seeing what’s happening; they’re for predicting and reacting to it before it becomes a catastrophe.

Let’s see them in action. Imagine you’re running a web application on EC2 instances behind an Application Load Balancer (ALB). You want to know if your application is struggling under load and if the ALB is healthy.

First, the alarm. We’ll set an alarm on the ALB’s HTTPCode_Target_5XX_Count metric. This metric tells you how many HTTP 5xx errors are being returned by your target instances.

aws cloudwatch put-metric-alarm \
    --alarm-name ALB-5XX-Errors \
    --alarm-description "Alarm when ALB returns 5xx errors" \
    --metric-name HTTPCode_Target_5XX_Count \
    --namespace AWS/ApplicationELB \
    --statistic Sum \
    --period 300 \
    --threshold 10 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions '{"Name":"LoadBalancer","Value":"app/my-alb/1234567890abcdef"}' \
    --evaluation-periods 2 \
    --datapoints-to-alarm 2 \
    --treat-missing-data as-missing \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:my-sns-topic

Here’s what’s happening:

  • --alarm-name: A human-readable name for the alarm.
  • --metric-name HTTPCode_Target_5XX_Count: The specific metric we’re watching.
  • --namespace AWS/ApplicationELB: The service this metric belongs to.
  • --statistic Sum: We care about the total number of 5xx errors over the period.
  • --period 300: We’ll evaluate this metric over 5-minute intervals.
  • --threshold 10: If the sum of 5xx errors exceeds 10.
  • --comparison-operator GreaterThanOrEqualToThreshold: The condition for triggering.
  • --dimensions: Crucially, we specify which ALB we’re monitoring by its ARN’s unique ID.
  • --evaluation-periods 2: The condition must be met for two consecutive 5-minute periods (10 minutes total) before the alarm triggers.
  • --datapoints-to-alarm 2: Out of the two evaluation periods, both must have breached the threshold.
  • --treat-missing-data as-missing: If data is missing for a period, it’s treated as a missing data point, which won’t cause the alarm to trigger unless missing is explicitly handled in the threshold logic.
  • --alarm-actions: When this alarm goes into an ALARM state, it will publish a message to this SNS topic, which could trigger an email, a PagerDuty incident, or even an auto-scaling action.

Now, the dashboard. Dashboards let you visualize multiple metrics in one place, giving you a consolidated view. Let’s create a dashboard showing ALB request counts, healthy host counts, and CPU utilization for our EC2 instances.

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ "AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234567890abcdef" ]
        ],
        "view": "timeSeries",
        "stacked": false,
        "region": "us-east-1",
        "title": "ALB Request Count",
        "period": 300
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ "AWS/ApplicationELB", "HealthyHostCount", "LoadBalancer", "app/my-alb/1234567890abcdef" ]
        ],
        "view": "timeSeries",
        "stacked": false,
        "region": "us-east-1",
        "title": "ALB Healthy Hosts",
        "period": 300
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ "AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "my-asg-name" ]
        ],
        "view": "timeSeries",
        "stacked": false,
        "region": "us-east-1",
        "title": "EC2 CPU Utilization (ASG)",
        "period": 300
      }
    }
  ],
  "title": "My Web App Overview",
  "autoRefresh": 60
}

You’d save this JSON content to a file (e.g., dashboard.json) and then use the AWS CLI:

aws cloudwatch put-dashboard --dashboard-name my-web-app-dashboard --dashboard-body file://dashboard.json

This dashboard gives you a live, visual representation of your application’s health. You can see request volume, how many instances are actually serving traffic, and if your servers are getting overloaded. The real power comes when you correlate these metrics. For instance, you might notice that high CPU utilization on EC2 instances often precedes an increase in ALB 5xx errors, prompting you to investigate your application’s performance or scale up your instances before the errors hit.

A common misconception is that CloudWatch only stores data for a short time. By default, metrics are stored for 14 days. For longer retention, you need to enable High-Resolution Metrics, which store data for 3 hours at 1-second granularity, 1 day at 1-minute granularity, and 14 days at 1-minute granularity. For anything beyond that, you’d typically export metrics to a data lake or a time-series database.

The next step is to explore anomaly detection, which automatically identifies unusual patterns in your metrics without you needing to set static thresholds.

Want structured learning?

Take the full Aws course →