Dynatrace’s anomaly detection is surprisingly good at figuring out what’s not normal, but it’s terrible at telling you what’s actually wrong without a ton of configuration.

Let’s say you have a web service, my-app-frontend, running on Kubernetes. Dynatrace is already ingesting metrics from it. By default, it’s probably flagging every minor spike in request latency or error rates as an anomaly. We want to make this smarter, so it only alerts us when it matters.

Here’s my-app-frontend in action, processing requests:

{
  "timestamp": "2023-10-27T10:30:05Z",
  "request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "method": "GET",
  "path": "/api/v1/users",
  "status": 200,
  "duration_ms": 45,
  "client_ip": "192.168.1.10"
}

And here’s a snapshot of its resource utilization, as seen by Dynatrace:

{
  "timestamp": "2023-10-27T10:30:00Z",
  "metric_id": "k8s.pod.cpu.usage.normalized",
  "value": 0.35,
  "unit": "cores",
  "entity_id": "K8S_POD:my-app-frontend-abcdef12345-ghijk"
},
{
  "timestamp": "2023-10-27T10:30:00Z",
  "metric_id": "k8s.pod.memory.usage",
  "value": 150,
  "unit": "Mi",
  "entity_id": "K8S_POD:my-app-frontend-abcdef12345-ghijk"
}

The core problem Dynatrace anomaly detection tries to solve is distinguishing signal from noise. If you get 100 alerts a day, you stop looking at them. We need to tune Dynatrace to tell us about the important deviations, not just any deviation.

This involves two main concepts: Anomaly Detection Rules and Alerting Rules.

Anomaly Detection Rules are where you tell Dynatrace what constitutes an anomaly for a given metric. You’re essentially setting thresholds or patterns that trigger an anomaly event.

Alerting Rules are where you define when to notify someone based on those anomaly events, and how (e.g., PagerDuty, Slack, email).

Let’s focus on tuning the anomaly detection for my-app-frontend. We’ll start with request latency.

Problem: Default anomaly detection flags every slight latency increase. Goal: Only alert on significant, sustained latency increases that impact user experience.

1. Configure Anomaly Detection for Request Latency:

  • Navigate: In Dynatrace, go to Alerting -> Anomaly detection -> Monitored entities.

  • Select Entity: Choose Kubernetes workload and then find your my-app-frontend workload.

  • Metric: Look for Request latency.

  • Anomaly Detection Type: The default is often "Auto-adaptive". We want to refine this. We’ll switch to "Thresholds" for more explicit control initially, or configure "Auto-adaptive" with specific parameters. Let’s go with "Auto-adaptive" but tune it.

    • Current Anomaly Event: Dynatrace might be flagging anything above the 95th percentile for 5 minutes.

    • Our Tune: We’ll set the "Sensitive" threshold to something higher, maybe the 99th percentile, and the "Warning" threshold to the 97th percentile. We’ll also increase the "Anomaly detection time window" to 10 minutes. This means Dynatrace will only consider a latency increase an anomaly if it’s sustained for at least 10 minutes and crosses our defined percentile thresholds.

    • Why this works: By increasing the percentile and the time window, we’re telling Dynatrace to ignore transient blips. It now needs to see a significant and prolonged increase in latency before flagging it. This drastically reduces noise.

2. Configure Anomaly Detection for Error Rates:

  • Problem: Default flags minor error rate fluctuations.

  • Goal: Alert only on a sustained, significant increase in HTTP 5xx errors.

  • Metric: Look for HTTP 5xx request errors.

  • Anomaly Detection Type: Again, we’ll use "Auto-adaptive" but tune it.

    • Current Anomaly Event: Dynatrace might flag any increase in 5xx errors.

    • Our Tune: We’ll set the "Warning" threshold for the rate of 5xx errors to 0.5% (meaning 0.5% of all requests are 5xxs) and the "Critical" threshold to 1.5%. We’ll also set the "Anomaly detection time window" to 5 minutes.

    • Why this works: This ensures we’re only alerted when the error rate becomes a real problem (over 1.5% of requests failing), not just when a single request fails. The 5-minute window prevents alerts for very short, self-correcting glitches.

3. Define Alerting Rules:

Now that we’ve told Dynatrace what to consider an anomaly, we need to tell it when to send notifications.

  • Navigate: Alerting -> Alerting rules.

  • Create New Rule: Let’s create a rule for our critical latency issues.

    • Name: High Latency - my-app-frontend (Critical)

    • Trigger: Anomaly detected

    • Conditions:

      • Entity type is Kubernetes workload
      • Entity name is my-app-frontend
      • Anomaly type is Request latency
      • Severity is Critical (This maps to the "Critical" threshold we set in anomaly detection).
    • Severity: Set to Critical.

    • Notification: Select your PagerDuty integration, Slack channel, etc.

    • Problem detection: Create new problem (default)

    • Auto-resolution: Resolve problem automatically when anomaly is no longer detected.

    • Why this works: This rule explicitly links the "Critical" latency anomaly event (which we’ve tuned to be significant and sustained) to a notification. When Dynatrace detects a critical latency anomaly for my-app-frontend for 10+ minutes (our tuned window), it will trigger this alert. The auto-resolution ensures the incident is closed when the condition clears.

We’ll create a similar rule for critical error rates:

  • Name: High Error Rate - my-app-frontend (Critical)
  • Trigger: Anomaly detected
  • Conditions:
    • Entity type is Kubernetes workload
    • Entity name is my-app-frontend
    • Anomaly type is HTTP 5xx request errors
    • Severity is Critical
  • Severity: Critical
  • Notification: PagerDuty, Slack, etc.

A Deep Dive into Auto-Adaptive Anomaly Detection Tuning:

When using "Auto-adaptive" for something like k8s.pod.cpu.usage.normalized, you’re not just setting a static threshold. Dynatrace builds a baseline of normal behavior over time. You can influence this baseline and the sensitivity.

  • Baseline Window: The period over which Dynatrace learns normal behavior (e.g., last 7 days, last 30 days). For a stable service, 7 days is often sufficient. For a service with strong weekly patterns, you might use 14 or 28 days.
  • Seasonality: You can enable this if your metrics have predictable daily, weekly, or monthly patterns. Dynatrace will then account for these when detecting anomalies. For my-app-frontend with typical business hours, enabling "Daily" and "Weekly" seasonality is a good idea.
  • Sensitivity: This is where you set the deviation percentages. For CPU usage, you might set:
    • Warning: 20% deviation from the learned baseline.
    • Critical: 40% deviation from the learned baseline.
  • Time Window: The duration the deviation must persist to trigger an event (e.g., 10 minutes).

The key is that "Auto-adaptive" learns. If you change your application’s normal operating parameters (e.g., deploy a new version that uses more CPU), Dynatrace will eventually adapt. However, during that adaptation period, you might get false positives or negatives. This is why sometimes a hybrid approach (tuned auto-adaptive or even static thresholds for very predictable metrics) is best.

The most common pitfall is leaving anomaly detection on its default "auto" settings without understanding what "auto" means for your specific metric and environment. Dynatrace’s auto-adaptive anomaly detection uses a combination of statistical methods, often involving moving averages, standard deviations, and percentile calculations over dynamic time windows, to establish a baseline. You can then configure how sensitive it is to deviations from this baseline, and over what duration. The system automatically adjusts its baseline based on the historical data it ingests, meaning that if your application’s normal behavior changes gradually, the anomaly detection will follow. However, sudden shifts or patterns that are part of normal but unusual (like a daily batch job) can trigger false positives if not accounted for via seasonality settings or specific rule configurations.

Once you’ve mastered tuning anomaly detection for your core services, the next step is to explore custom metrics and how to apply anomaly detection to them.

Want structured learning?

Take the full Dynatrace course →