Datadog composite monitors don’t just let you combine simple alerts; they let you express complex conditions that often mirror real-world system states, enabling you to avoid alert fatigue and pinpoint the actual problem.
Let’s say you’re monitoring a web service and want to be alerted only when both the error rate is high and the latency is also creeping up. A simple monitor on error rate alone might fire too often during legitimate, albeit busy, periods. A composite monitor allows you to weave these conditions together.
Here’s a composite monitor setup in Datadog:
Monitor 1: High Error Rate
-
Type: Metric
-
Query:
avg:http.request.error.count{service:my-web-app}.as_count() by {host} -
Alert when:
over5 minutes> 100 -
Message:
High error rate on {{host.name}}
Monitor 2: High Latency
-
Type: Metric
-
Query:
avg:http.request.duration{service:my-web-app}.by_host() by {host} -
Alert when:
over5 minutes> 500ms -
Message:
High latency on {{host.name}}
Now, the Composite Monitor:
-
Type: Composite
-
Query:
(A && B)whereAis the ID of the "High Error Rate" monitor andBis the ID of the "High Latency" monitor. -
Alert when:
Any monitor is triggered -
Message:
Critical: {{a.host.name}} is experiencing both high error rates and high latency!
When you view this in Datadog, you’ll see the individual monitors firing (or not firing), and the composite monitor will only transition to an alert state when both Monitor 1 and Monitor 2 are in an alert state simultaneously for the specified evaluation period.
The power here lies in the logical operators. You can use && (AND), || (OR), and ! (NOT) to build intricate alert logic. For instance, you might want an alert if:
- High error rate AND high latency: (As above) - Indicates a genuine problem.
- High error rate OR high latency:
(A || B)- Alerts if either condition is met, useful for catching potential issues early. - High error rate BUT NOT high latency:
(A && !B)- This is more niche, but could signal a specific type of failure where errors are occurring but not impacting overall response times (e.g., background processing failures). - Low traffic AND high error rate: This is a classic. You might have a monitor for
avg:http.request.error.count{service:my-web-app}.as_count() by {host} > 5(Monitor A) and another forsum:http.request.count{service:my-web-app}.as_count() by {host} < 10(Monitor B). The composite(A && B)would alert you if errors are spiking relative to very low traffic, suggesting a fundamental issue rather than just load.
The "evaluation period" on the composite monitor itself is crucial. If your individual monitors have a 5-minute evaluation, and the composite has a 1-minute evaluation, the composite will check if both A and B were in an alert state at any point within that 1 minute. If you want to ensure both conditions were met concurrently for the full duration, ensure the composite’s evaluation period is at least as long as the individual monitors.
The most surprising thing about composite monitors is how they can drastically reduce alert noise by filtering out transient or single-point failures. Instead of getting 10 alerts about high latency and 5 alerts about high error rates, you get one consolidated alert when the combination that truly signifies a problem occurs. This makes your on-call engineers more effective and less prone to "alert fatigue."
When troubleshooting a composite monitor that isn’t firing as expected, always check the state of the individual monitors first. Datadog’s UI makes this easy by showing the status of A, B, C, etc., right alongside the composite status. If A is alerting but B is not, the (A && B) composite will remain resolved.
A common pitfall is overlooking the impact of different time windows and aggregation methods between your base monitors and the composite. If Monitor A evaluates errors over 5 minutes and Monitor B evaluates latency over 1 minute, your composite (A && B) might trigger based on these disparate timeframes, leading to unexpected behavior. Always ensure your base monitors are aligned in their evaluation periods and aggregation logic when combined with AND operators.
The next step after mastering composite monitors is exploring scheduled overrides and notification channels to fine-tune how and when these complex alerts are communicated.