Datadog alert flapping is often a symptom, not the disease, and understanding the underlying configuration is key to true resolution.

Let’s say you have a metric, system.cpu.user, that’s normally around 5%. Suddenly, it spikes to 95%, triggers an alert, then drops back to 6%, and the alert resolves. This cycle repeats every few minutes. This is flapping.

# Example of a flapping alert in Datadog's UI
alert_name: High CPU Usage
metric: system.cpu.user
query: avg(last_5m):avg:system.cpu.user{host:webserver*} > 80
threshold: 80
evaluation_delay: 60
no_data_timeframe: 15
notify_no_data: true
notify_audit: false

The core issue is that the metric’s value is oscillating around the alert threshold frequently enough to trigger and resolve the alert within a short period, often due to noisy data or transient spikes.

Common Causes and Fixes

  1. Metric Volatility: The metric itself is inherently noisy and fluctuates rapidly.

    • Diagnosis: Graph the raw metric over a longer period (e.g., 1 hour) to observe its natural variability. Look for quick up-and-down movements.
    • Fix: Implement a longer eval window in your alert query. Instead of avg(last_5m), try avg(last_15m). This smooths out short-term spikes.
      # Modified query
      query: avg(last_15m):avg:system.cpu.user{host:webserver*} > 80
      
      • Why it works: A longer evaluation window requires the condition to be met for a sustained period, filtering out brief, anomalous spikes.
  2. Aggregator Choice: Using avg on a metric that is reported at a very high frequency might still be too sensitive.

    • Diagnosis: Check how frequently the metric is reported. If it’s every 10 seconds, an avg(last_5m) could still be sensitive to rapid changes.
    • Fix: Consider using max or min if the alert condition is about any occurrence of a high value, or median if you want to ignore outliers.
      # Example using median
      query: median(last_5m):median:system.cpu.user{host:webserver*} > 80
      
      • Why it works: median is less sensitive to extreme outliers than avg, and max would only trigger if the highest value within the window exceeds the threshold.
  3. evaluation_delay Too Short: This setting determines how long Datadog waits after a data point is received before evaluating the alert condition. If it’s too short, the alert might trigger on data that is still considered "in-flight."

    • Diagnosis: Review the evaluation_delay setting. If your metric collection interval is, say, 1 minute, an evaluation_delay of 60 seconds means it’s evaluating immediately as data arrives.
    • Fix: Increase evaluation_delay. A common practice is to set it to at least one or two collection intervals.
      evaluation_delay: 120 # Wait 2 minutes after data arrives
      
      • Why it works: This ensures that the alert condition is evaluated on a more complete set of data for that time period, reducing sensitivity to transient states.
  4. no_data_timeframe Misconfiguration (and notify_no_data): If an alert is configured to notify_no_data: true, and the no_data_timeframe is too short, it can cause flapping if data temporarily stops arriving or is delayed.

    • Diagnosis: Check no_data_timeframe and notify_no_data. If the metric is expected every 5 minutes, but no_data_timeframe is 5 minutes, a 1-minute delay could trigger a "no data" alert, which then resolves when data reappears.
    • Fix: Set no_data_timeframe to a value significantly longer than your metric’s collection interval (e.g., 2x or 3x). If you don’t want "no data" alerts, set notify_no_data: false.
      no_data_timeframe: 300 # Expect data at least every 5 minutes (300 seconds)
      notify_no_data: false # Disable no-data notifications
      
      • Why it works: A longer no_data_timeframe prevents alerts from firing on brief network hiccups or collection delays. Disabling notify_no_data removes this class of flapping entirely.
  5. Alert Threshold Too Close to Normal: The threshold is set too close to the typical operating range of the metric, making it easy to cross and recross.

    • Diagnosis: Graph the metric’s historical data and visually inspect where the threshold lies relative to the typical distribution.
    • Fix: Increase the threshold to a level that represents a true anomaly. For example, if CPU is normally 5-10% and spikes to 20% briefly, then back to 7%, an 80% threshold is too sensitive. Raise it to 70% or 80%.
      threshold: 80 # (Assuming this was already set, but for illustration)
      
      • Why it works: A higher threshold requires a more significant, sustained deviation from normal operation to trigger an alert, filtering out minor fluctuations.
  6. Complex Alert Logic: When combining multiple metrics or conditions with AND/OR, the interaction can sometimes lead to flapping if one component of the condition becomes unstable.

    • Diagnosis: Break down complex alerts into simpler ones. Monitor each individual metric’s behavior.
    • Fix: Simplify the alert query or adjust the evaluation windows and thresholds for each component metric independently.
      # Example of a complex query that might need simplification
      query: (avg(last_5m):avg:system.cpu.user{host:webserver*} > 80) AND (avg(last_5m):avg:system.mem.free{host:webserver*} < 1024)
      
      • Why it works: Isolating and stabilizing each part of a composite alert reduces the likelihood of unintended interactions causing flapping.

After addressing these common causes, you might find that your alerts are now too quiet. The next problem you’ll encounter is an alert that should fire but doesn’t, due to the same smoothing and delay mechanisms you just implemented.

Want structured learning?

Take the full Datadog course →