Dynatrace’s "Adaptive AI" for baselining CPU and memory thresholds isn’t magic; it’s a sophisticated system that learns your application’s normal behavior and flags deviations.
Let’s watch it in action. Imagine a web service, my-web-app, running on a Kubernetes cluster. We’ve got Dynatrace OneAgent installed on the nodes and the Kubernetes integration enabled.
Here’s a snippet of what Dynatrace might show for my-web-app’s CPU usage:
Service: my-web-app
Host: k8s-node-123
Process: java (PID 12345)
CPU Usage:
Current: 75%
Baseline (last 24h): 30%
Anomaly detected: High CPU Usage
Reason: Sustained CPU usage above baseline for 10 minutes.
And for memory:
Service: my-web-app
Host: k8s-node-123
Process: java (PID 12345)
Memory Usage:
Current: 85% (Heap)
Baseline (last 24h): 50%
Anomaly detected: High Memory Usage
Reason: Sustained memory usage above baseline for 15 minutes.
The core problem Dynatrace’s AI addresses is the futility of static, human-defined thresholds. A fixed threshold of 80% CPU might be perfectly fine for your app during peak hours but trigger constant noise during a legitimate, albeit high, load. Conversely, a threshold of 95% might miss a real problem until it’s too late. Dynatrace’s AI builds a dynamic baseline.
Here’s how it works internally:
-
Data Collection: Dynatrace OneAgent, running on your hosts or within your containers, constantly collects fine-grained performance metrics. This includes CPU utilization (per core, per process), memory usage (heap, non-heap, system), network I/O, disk I/O, thread counts, garbage collection activity, and more. For Kubernetes, it pulls cluster-level metrics and pod-specific data.
-
Time Series Analysis: This raw data is fed into Dynatrace’s backend, where it’s stored as time-series data. The AI then analyzes these series over configurable lookback windows (e.g., last 24 hours, last 7 days, last 30 days).
-
Statistical Modeling: For each metric, the AI builds a statistical model. This isn’t just a simple average; it accounts for seasonality (daily, weekly patterns), trends, and noise. It learns, for example, that your app’s CPU usage typically spikes to 60% every weekday at 10 AM and then drops. It also learns what "normal" variation looks like for your specific workload.
-
Anomaly Detection: Once a baseline model is established, the AI continuously compares the current metric values against the predicted values from the model. If the current value deviates significantly and persistently from the predicted baseline (beyond a dynamically calculated tolerance), an anomaly is triggered. The system considers factors like the duration of the deviation, the magnitude of the deviation, and the historical variability of the metric.
-
Root Cause Analysis (RCA): This is where Dynatrace shines. When an anomaly is detected, the AI doesn’t just tell you "CPU is high." It uses its understanding of the application topology and dependencies (thanks to the Smartscape® topology mapping) to trace the anomaly back to its root cause. It analyzes related metrics and services to pinpoint the specific process, host, or even the exact code method that is causing the problem. For example, it might correlate high CPU on a web server with a specific database query that’s taking too long.
The levers you control are primarily around the configuration of anomaly detection and the scope of monitoring.
-
Anomaly Detection Settings: Within Dynatrace, you can configure sensitivity levels for anomaly detection. You can also enable/disable specific anomaly detection rules (e.g., "High CPU usage," "High Memory usage," "Low Disk Space") for specific services, hosts, or environments. You can also define custom event rules that trigger alerts based on specific conditions, often using Dynatrace’s powerful query language (DQL) to inspect the collected metrics.
For instance, to adjust the sensitivity of CPU anomaly detection for a specific service named
payment-service, you’d navigate toSettings->Anomaly detection->Services->CPU usageand find the settings forpayment-service. You can adjust the "Sensitivity" slider (e.g., from "Low" to "Medium" or "High") or define custom thresholds based on percentile or absolute values if the AI’s learned baseline isn’t precise enough for a very specific, predictable spike. -
Monitoring Scope: Ensure your OneAgents are deployed correctly to cover all relevant processes and hosts. For Kubernetes, verify that the Dynatrace Operator is configured to monitor your namespaces and pods. The AI can only learn about what it can see. If a critical microservice isn’t being monitored, its behavior won’t be part of the baseline.
-
Maintenance Windows: You can define maintenance windows in Dynatrace. During these periods, anomaly detection is suppressed for specific entities. This is crucial for planned deployments or upgrades that you know will cause temporary performance deviations. You can set these up under
Settings->Anomalies->Maintenance windows.
The "AI" aspect is really about its ability to learn and adapt without constant manual tuning. It’s not just looking at raw numbers; it’s understanding the context of those numbers within your application’s unique operational rhythm.
What often surprises people is how quickly the AI can adapt to gradual changes. If your application’s baseline CPU usage slowly creeps up over weeks due to a memory leak or inefficient code, Dynatrace’s AI will initially register this as part of the new "normal" baseline. It won’t flag it as an anomaly until the rate of change or the absolute deviation from the newly established baseline crosses a specific, short-term anomaly detection threshold. This is by design to avoid alert fatigue from slow degradations, but it means you might need to combine AI-driven anomaly detection with longer-term trend analysis or custom alerts for detecting slow performance regressions.
The next thing you’ll likely encounter is understanding how Dynatrace correlates these CPU/memory anomalies with other potential root causes, like network latency or database performance issues, to build a complete picture of a problem.