Detect Anomalies in Datadog Metrics with Machine Learning Monitors (2026)

Datadog’s anomaly detection monitors don’t just flag spikes; they predict future metric values and alert you when the actual observed value deviates significantly from that prediction.

Imagine you’re monitoring the number of active users on your web application. You’d expect this number to fluctuate throughout the day, with peaks during business hours and dips overnight. A simple static threshold might trigger an alert every night, which is noisy and unhelpful. An anomaly detection monitor, however, learns the pattern of your user traffic. It knows that user count typically drops by 80% between 1 AM and 5 AM. If, on a particular night, the user count only drops by 50%, the monitor will flag this as an anomaly because it’s an unusual deviation from the learned daily pattern, even though it’s still a low absolute number.

Let’s see this in action with a hypothetical metric: web.requests.count.

# Simulate receiving metric data
import time
import random

def send_metric(metric_name, value):
    print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Sending metric: {metric_name} = {value}")
    # In a real scenario, this would go to Datadog's API

# Simulate a typical weekday traffic pattern
def simulate_traffic():
    current_hour = time.localtime().tm_hour
    if 9 <= current_hour < 17: # Business hours
        # Simulate higher traffic with some noise
        return random.randint(5000, 10000)
    else: # Off-peak hours
        # Simulate lower traffic with less noise
        return random.randint(500, 2000)

# Simulate an anomaly
def simulate_anomaly():
    current_hour = time.localtime().tm_hour
    if 9 <= current_hour < 17: # Business hours
        # Simulate a sudden drop during peak hours
        return random.randint(500, 1500)
    else: # Off-peak hours
        # Simulate a sudden spike during off-peak hours
        return random.randint(4000, 7000)

# Example execution loop (run this for a few minutes)
if __name__ == "__main__":
    print("Starting metric simulation. Press Ctrl+C to stop.")
    try:
        while True:
            # Randomly decide whether to send normal traffic or an anomaly
            if random.random() < 0.05: # 5% chance of anomaly
                send_metric("web.requests.count", simulate_anomaly())
            else:
                send_metric("web.requests.count", simulate_traffic())
            time.sleep(60) # Send a metric every minute
    except KeyboardInterrupt:
        print("\nStopping simulation.")

If you were to run this code and feed the output into Datadog, an anomaly detection monitor configured for web.requests.count would learn the cyclical nature of the simulate_traffic function. When simulate_anomaly is called, the monitor would detect the deviation from the expected pattern.

The core problem anomaly detection solves is identifying deviations from normal behavior without needing to pre-define what "normal" looks like in absolute terms. Traditional alerting relies on static thresholds (e.g., "alert if requests < 1000") or simple moving averages. These are brittle; they miss subtle issues that don’t cross a hard line but represent a significant change in trend, and they generate false positives when normal, predictable variations occur.

Anomaly detection monitors in Datadog use machine learning algorithms (often variants of time-series forecasting models like ARIMA or Prophet, or simpler statistical methods like Holt-Winters) to analyze historical metric data. They build a model that captures seasonality, trends, and other cyclical patterns inherent in the data. Once this model is trained, it’s used to predict the expected value of the metric for the current time period. The monitor then compares the actual observed metric value to this predicted value. If the difference exceeds a predefined sensitivity threshold (e.g., 2 or 3 standard deviations from the predicted mean), an alert is triggered.

The key levers you control are:

Metric Selection: The specific metric you want to monitor. Choose metrics that indicate the health or performance of a system component.
Detection Method: Datadog offers a few ML-based detection methods. The most common is "Anomaly Detection" which predicts future values. Other methods might focus on deviation from historical norms or seasonality.
Sensitivity: This parameter directly controls how much the actual value needs to deviate from the predicted value to trigger an alert. Higher sensitivity means more alerts for smaller deviations; lower sensitivity means fewer alerts, requiring larger deviations. This is often expressed as a multiplier of the standard deviation.
Alerting Threshold (Optional): You can combine anomaly detection with traditional thresholds. For instance, "alert if the metric is anomalous AND the value is below 500."
Evaluation Window: How far back the model looks to learn patterns. A longer window captures broader seasonality (e.g., weekly or monthly trends) but might be slower to adapt to sudden, permanent shifts in behavior.
Advanced Options: For more complex scenarios, you can specify seasonality (e.g., daily, weekly) and whether to automatically update the model.

A common misunderstanding is that anomaly detection magically knows what "good" is. It doesn’t. It learns what is typical for that specific metric based on the historical data you provide. If your historical data includes periods of performance degradation or unusual behavior, the ML model will learn those as part of the "normal" pattern. This is why it’s crucial to have a clean baseline of healthy operational data for the monitor to learn from. If you’ve been running with a known issue for weeks, the anomaly detector might not flag that issue as anomalous because it’s become the established norm.

Once you’ve successfully configured anomaly detection monitors and are receiving alerts for unusual deviations, the next logical step is to integrate these alerts into automated remediation workflows.