Datadog’s Watchdog is designed to surface issues you’d otherwise have to write monitors for, and it does this by continuously analyzing your time-series data for unusual patterns. The most surprising thing is that Watchdog often catches subtle, emergent issues before they become critical outages, simply by noticing deviations from normal behavior that humans might miss or dismiss as noise.

Let’s see Watchdog in action. Imagine you have a web service, and you’re tracking its request latency.

# Example: Request Latency for `my-web-service`
# This is a simplified representation of what Datadog might show.

# Normal behavior (e.g., P95 latency over 1 hour)
# Time        | P95 Latency (ms)
# --------------------------------
# 10:00 AM    | 150
# 10:05 AM    | 155
# 10:10 AM    | 148
# ...
# 10:55 AM    | 152

# Then, an anomaly is detected:
# Time        | P95 Latency (ms)
# --------------------------------
# 11:00 AM    | 153
# 11:05 AM    | 156
# 11:10 AM    | 350  <-- Watchdog flags this!
# 11:15 AM    | 365  <-- Watchdog flags this!
# 11:20 AM    | 355  <-- Watchdog flags this!
# 11:25 AM    | 160  <-- Latency returns to normal

Without Watchdog, you’d need to define a monitor that triggers if P95 latency exceeds, say, 300ms for 5 minutes. But what if the rate of increase is the problem, or a slight, sustained increase that doesn’t quite hit a static threshold but is still indicative of a problem? Watchdog analyzes metrics like request rate, error rate, CPU utilization, memory usage, disk I/O, network traffic, and more, across your hosts, containers, and applications. It learns the "normal" patterns for each metric, including seasonality (e.g., higher traffic on weekdays), trends, and correlations between metrics.

When a deviation occurs that exceeds its learned baseline and statistical significance thresholds, Watchdog generates an alert. These alerts are not just generic "metric X is high"; they provide context. For instance, a Watchdog alert might say: "High P95 latency for my-web-service detected. Latency increased by 250% over baseline for the last 10 minutes, correlating with a 30% increase in my-web-service error rate."

The core problem Watchdog solves is alert fatigue and missed critical signals. Manually creating monitors for every possible failure mode is a losing battle. You might set a threshold for latency, but miss an increase in error rates that’s causing the latency. Or you might set an error rate threshold too high, and miss a gradual creep of errors that indicates a slow-burn failure. Watchdog’s anomaly detection works by building a dynamic, multi-dimensional understanding of your system’s health. It’s not just looking at one metric in isolation; it’s looking at how multiple metrics behave together and identifying when that collective behavior becomes anomalous.

You control Watchdog primarily through its configuration and by providing it with high-quality data. Ensure your core metrics (latency, error rates, resource utilization) are being collected and tagged appropriately. You can then fine-tune its behavior:

  • Enabling/Disabling Watchdog: You can turn Watchdog on or off globally or for specific services or hosts.
  • Metric Scope: Watchdog analyzes metrics that have sufficient data points and a discernible pattern. Metrics with very low volume or highly erratic behavior might not be analyzed.
  • Customization: For specific services or applications, you can adjust the sensitivity or focus Watchdog on particular anomaly types (e.g., "performance anomalies" vs. "error anomalies"). This is often done via Datadog’s datadog.yml configuration or through the Datadog UI under APM or Infrastructure settings. For example, to tune anomaly detection sensitivity for a service named frontend-api, you might configure:
# datadog.yml snippet
dd.trace.service.anomaly_detection.sensitivity: "high"
dd.trace.service.anomaly_detection.services: "frontend-api"

The key levers are the quality and completeness of your metrics, and your ability to interpret the context Watchdog provides. The more granular and well-tagged your data, the better Watchdog can learn your system’s normal behavior and detect deviations.

The true power of Watchdog lies in its ability to detect correlated anomalies across different metric types without explicit configuration. For instance, it might notice that while P95 latency is within normal bounds, the distribution of latencies has shifted significantly, with a long tail of very high latencies appearing, even if the median or average hasn’t jumped dramatically. This is often overlooked by static threshold monitors.

You’ll next want to explore how to integrate Watchdog alerts into your incident response workflows, potentially triggering automated actions or routing alerts to specific on-call teams.

Want structured learning?

Take the full Datadog course →