Elasticsearch’s Machine Learning features can spot unusual patterns in your time-series data without you needing to define what "unusual" means beforehand.
Let’s see this in action. Imagine you’re tracking website traffic, and you want to know when something’s off. This isn’t about knowing that 10,000 users is "good" and 1,000 is "bad." It’s about detecting that 5,000 users is abnormally low for a Tuesday morning at 10 AM, even if you never explicitly told Elasticsearch that.
Here’s a sample of what that data might look like in Elasticsearch:
{
"timestamp": "2023-10-27T10:00:00Z",
"user_count": 5500,
"status": "success"
}
{
"timestamp": "2023-10-27T10:05:00Z",
"user_count": 5620,
"status": "success"
}
{
"timestamp": "2023-10-27T10:10:00Z",
"user_count": 4800,
"status": "success"
}
To set up anomaly detection, you’d first create an "anomaly detection job" within Elasticsearch. This involves specifying the data you want to analyze, typically an index pattern like website-traffic-*. You also tell it which fields are relevant, such as @timestamp for the time dimension and user_count for the metric.
Once the job is running, Elasticsearch builds a model of "normal" behavior. It learns the typical fluctuations, seasonality (like daily or weekly patterns), and trends in your data. When a new data point arrives that deviates significantly from this learned normal, it’s flagged as an anomaly.
The core of this capability lies in unsupervised learning algorithms. Elasticsearch ML doesn’t require labeled data – you don’t need to pre-tag instances as "anomalous" or "normal." Instead, it uses techniques like:
- Time Series Boosted Trees: This ensemble method combines multiple decision trees with boosting to capture complex temporal relationships and identify deviations.
- R-CNN (Recurrent Convolutional Neural Network): For more complex patterns, especially those involving sequences and local features within the time series.
- ARIMA (AutoRegressive Integrated Moving Average): A classic statistical method that models the data based on its past values and forecast errors.
The "job" configuration in Kibana’s Machine Learning section is where you define these parameters. You might set a bucket_span – the time duration over which data is aggregated for analysis. A bucket_span of 15m means Elasticsearch will analyze data in 15-minute intervals. You can also specify detectors, which are the calculations performed on your fields. For instance, a detector could be MLDetect.sum(user_count).
Here’s a glimpse of what the output looks like when an anomaly is detected:
{
"timestamp": 1698409200000,
"anomaly_score": 75.3,
"detector_index": 0,
"events": [
{
"bucket_selector": "MLDetect.sum(user_count)",
"actual": [4800],
"expected": [7250]
}
],
"record_score": 88.1,
"anomaly_occurred": true,
"job_id": "website-traffic-anomaly-job"
}
The anomaly_score (0-100) indicates the severity of the anomaly. A score of 75.3 means this data point is significantly unusual. The events section shows that the sum of user_count was 4800, but the model expected it to be around 7250.
The true power here is in its adaptability. If your website experiences a sudden surge in traffic due to a marketing campaign, the ML job will learn this new "normal" over time and adjust its expectations. It’s not a static rule-based system; it’s a dynamic model that evolves with your data.
One subtle but critical aspect of anomaly detection jobs is how they handle "influencers." If you’re tracking server logs and want to detect anomalies, you might add server_id as an influencer field. When an anomaly is detected, Elasticsearch will not only tell you that an anomaly occurred but also which specific server was exhibiting the anomalous behavior, allowing for much faster root cause analysis. This is because the ML model learns the typical behavior for each unique value of the influencer field, not just the overall metric.
The next step after detecting anomalies is often to enrich them with context or to trigger automated actions based on the anomaly score.