Datadog alert flapping is often a symptom, not the disease, and understanding the underlying configuration is key to true resolution.
Let’s say you have a metric, system.cpu.user, that’s normally around 5%. Suddenly, it spikes to 95%, triggers an alert, then drops back to 6%, and the alert resolves. This cycle repeats every few minutes. This is flapping.
# Example of a flapping alert in Datadog's UI
alert_name: High CPU Usage
metric: system.cpu.user
query: avg(last_5m):avg:system.cpu.user{host:webserver*} > 80
threshold: 80
evaluation_delay: 60
no_data_timeframe: 15
notify_no_data: true
notify_audit: false
The core issue is that the metric’s value is oscillating around the alert threshold frequently enough to trigger and resolve the alert within a short period, often due to noisy data or transient spikes.
Common Causes and Fixes
-
Metric Volatility: The metric itself is inherently noisy and fluctuates rapidly.
- Diagnosis: Graph the raw metric over a longer period (e.g., 1 hour) to observe its natural variability. Look for quick up-and-down movements.
- Fix: Implement a longer
evalwindow in your alert query. Instead ofavg(last_5m), tryavg(last_15m). This smooths out short-term spikes.# Modified query query: avg(last_15m):avg:system.cpu.user{host:webserver*} > 80- Why it works: A longer evaluation window requires the condition to be met for a sustained period, filtering out brief, anomalous spikes.
-
Aggregator Choice: Using
avgon a metric that is reported at a very high frequency might still be too sensitive.- Diagnosis: Check how frequently the metric is reported. If it’s every 10 seconds, an
avg(last_5m)could still be sensitive to rapid changes. - Fix: Consider using
maxorminif the alert condition is about any occurrence of a high value, ormedianif you want to ignore outliers.# Example using median query: median(last_5m):median:system.cpu.user{host:webserver*} > 80- Why it works:
medianis less sensitive to extreme outliers thanavg, andmaxwould only trigger if the highest value within the window exceeds the threshold.
- Why it works:
- Diagnosis: Check how frequently the metric is reported. If it’s every 10 seconds, an
-
evaluation_delayToo Short: This setting determines how long Datadog waits after a data point is received before evaluating the alert condition. If it’s too short, the alert might trigger on data that is still considered "in-flight."- Diagnosis: Review the
evaluation_delaysetting. If your metric collection interval is, say, 1 minute, anevaluation_delayof 60 seconds means it’s evaluating immediately as data arrives. - Fix: Increase
evaluation_delay. A common practice is to set it to at least one or two collection intervals.evaluation_delay: 120 # Wait 2 minutes after data arrives- Why it works: This ensures that the alert condition is evaluated on a more complete set of data for that time period, reducing sensitivity to transient states.
- Diagnosis: Review the
-
no_data_timeframeMisconfiguration (andnotify_no_data): If an alert is configured tonotify_no_data: true, and theno_data_timeframeis too short, it can cause flapping if data temporarily stops arriving or is delayed.- Diagnosis: Check
no_data_timeframeandnotify_no_data. If the metric is expected every 5 minutes, butno_data_timeframeis 5 minutes, a 1-minute delay could trigger a "no data" alert, which then resolves when data reappears. - Fix: Set
no_data_timeframeto a value significantly longer than your metric’s collection interval (e.g., 2x or 3x). If you don’t want "no data" alerts, setnotify_no_data: false.no_data_timeframe: 300 # Expect data at least every 5 minutes (300 seconds) notify_no_data: false # Disable no-data notifications- Why it works: A longer
no_data_timeframeprevents alerts from firing on brief network hiccups or collection delays. Disablingnotify_no_dataremoves this class of flapping entirely.
- Why it works: A longer
- Diagnosis: Check
-
Alert Threshold Too Close to Normal: The threshold is set too close to the typical operating range of the metric, making it easy to cross and recross.
- Diagnosis: Graph the metric’s historical data and visually inspect where the threshold lies relative to the typical distribution.
- Fix: Increase the threshold to a level that represents a true anomaly. For example, if CPU is normally 5-10% and spikes to 20% briefly, then back to 7%, an 80% threshold is too sensitive. Raise it to 70% or 80%.
threshold: 80 # (Assuming this was already set, but for illustration)- Why it works: A higher threshold requires a more significant, sustained deviation from normal operation to trigger an alert, filtering out minor fluctuations.
-
Complex Alert Logic: When combining multiple metrics or conditions with
AND/OR, the interaction can sometimes lead to flapping if one component of the condition becomes unstable.- Diagnosis: Break down complex alerts into simpler ones. Monitor each individual metric’s behavior.
- Fix: Simplify the alert query or adjust the evaluation windows and thresholds for each component metric independently.
# Example of a complex query that might need simplification query: (avg(last_5m):avg:system.cpu.user{host:webserver*} > 80) AND (avg(last_5m):avg:system.mem.free{host:webserver*} < 1024)- Why it works: Isolating and stabilizing each part of a composite alert reduces the likelihood of unintended interactions causing flapping.
After addressing these common causes, you might find that your alerts are now too quiet. The next problem you’ll encounter is an alert that should fire but doesn’t, due to the same smoothing and delay mechanisms you just implemented.