Elastic APM’s machine learning capabilities can detect anomalous behavior in your application’s performance without you needing to define specific thresholds.
Let’s see it in action. Imagine you have a web application monitored by Elastic APM. You’ve got your services instrumented, and data is flowing into Elasticsearch.
Here’s a snippet of what that data might look like in Kibana’s Discover tab:
{
"@timestamp": "2023-10-27T10:00:00.000Z",
"service.name": "my-web-app",
"transaction.type": "request",
"transaction.name": "GET /api/users",
"event.duration": {
"us": 150000
},
"http.response.status_code": 200,
"error.count": 0,
"user_experience.score": 0.95
}
{
"@timestamp": "2023-10-27T10:00:05.000Z",
"service.name": "my-web-app",
"transaction.type": "request",
"transaction.name": "GET /api/users",
"event.duration": {
"us": 180000
},
"http.response.status_code": 200,
"error.count": 0,
"user_experience.score": 0.92
}
// ... many more similar documents ...
{
"@timestamp": "2023-10-27T10:35:15.000Z",
"service.name": "my-web-app",
"transaction.type": "request",
"transaction.name": "GET /api/users",
"event.duration": {
"us": 950000
},
"http.response.status_code": 500,
"error.count": 1,
"user_experience.score": 0.20
}
The problem Elastic ML solves is the sheer volume and dynamism of application performance metrics. Manually setting thresholds for every transaction, every service, across different times of day, days of the week, and varying load conditions is practically impossible. You’d either end up with too many false positives or miss critical issues. Elastic ML automates this by learning the "normal" behavior of your system and flagging deviations.
Internally, Elastic ML uses unsupervised learning algorithms. When you configure an anomaly detection job for APM data, it analyzes historical time-series data for specific metrics like transaction duration, error rates, or request counts. It builds a model of what constitutes typical behavior for that metric, considering factors like seasonality (e.g., daily or weekly patterns) and trends. When new data arrives, it compares it against this learned model. If the new data deviates significantly from the expected pattern, it’s flagged as an anomaly.
You control what gets analyzed. You can create anomaly detection jobs based on:
- Service Name:
service.name : "my-web-app" - Transaction Type:
transaction.type : "request" - Transaction Name:
transaction.name : "GET /api/users" - Specific Metrics:
metric.name : "transaction.duration.us"ormetric.name : "error.count" - Any other relevant field:
http.response.status_code : 500or even custom fields you’ve added.
You can combine these to focus your analysis. For instance, you might want to detect anomalies only in the duration of GET /api/users transactions within your my-web-app service.
Here’s a simplified view of creating a job in Kibana’s Machine Learning section:
- Navigate to Machine Learning > Anomaly Detection > Create job.
- Select APM as the data type.
- Choose your APM index pattern (e.g.,
apm-*). - Select the time field (usually
@timestamp). - For Detector, choose a metric like
Transaction Duration. - For By fields, you might select
service.nameandtransaction.nameto analyze each transaction type within each service independently. - You can then add Partition fields if you want to analyze metrics across different entities, like
host.nameif you’re looking for unusual behavior on a specific server.
Once the job runs, you’ll see anomalies plotted on a timeline. You can drill down into specific anomalous data points to see the actual metrics and the expected values calculated by the model. This helps you quickly understand the nature of the deviation.
The surprising part for many is how well the models adapt to changing "normal" behavior. If your application’s typical response time gradually increases over weeks due to legitimate code improvements or increased load that your system handles well, the anomaly detection job will adjust its baseline. It’s not looking for absolute values, but for changes relative to what it has learned is typical now. This continuous learning is what makes it so powerful for dynamic environments.
The next step after detecting performance anomalies is often to proactively prevent them by integrating anomaly detection results into alerting mechanisms.