Elasticsearch Watcher can alert on APM transaction data when it breaches defined thresholds, but the real magic is how it correlates across dimensions you might not expect.
Let’s see it in action. Imagine we’re monitoring a critical e-commerce checkout service. We want to know immediately if the average duration of POST /checkout transactions exceeds 500 milliseconds, but only during peak hours (9 AM to 5 PM UTC) and only for requests originating from the eu-west-1 region.
Here’s the Elasticsearch Watcher configuration to achieve this:
PUT _watcher/watch/checkout_performance_alert
{
"trigger": {
"schedule": {
"interval": "5m"
}
},
"input": {
"search": {
"request": {
"indices": [
"apm-*-transaction*"
],
"body": {
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-5m/m",
"lt": "now/m"
}
}
},
{
"term": {
"transaction.name": "POST /checkout"
}
},
{
"term": {
"cloud.region": "eu-west-1"
}
},
{
"range": {
"hour_of_day": {
"gte": 9,
"lt": 17
}
}
}
]
}
},
"aggs": {
"avg_duration_ms": {
"avg": {
"field": "transaction.duration.us"
}
}
}
}
}
}
},
"condition": {
"script": {
"source": "ctx.payload.aggregations.avg_duration_ms.value / 1000.0 > params.threshold_ms",
"params": {
"threshold_ms": 500
}
}
},
"actions": {
"send_notification": {
"email": {
"to": "devops@example.com",
"subject": "ALERT: High Checkout Transaction Duration in eu-west-1",
"body": {
"text": "Average duration for POST /checkout in eu-west-1 exceeded {{ctx.payload.aggregations.avg_duration_ms.value}} us ({{#format_number}}{{ctx.payload.aggregations.avg_duration_ms.value}}{{/format_number}} ms) between 9-5 UTC. Current threshold is 500ms."
}
}
}
}
}
This watch runs every 5 minutes ("interval": "5m"). It queries APM transaction data ("indices": ["apm-*-transaction*"]) from the last 5 minutes ("gte": "now-5m/m", "lt": "now/m"). The bool query filters for transactions named POST /checkout, located in eu-west-1, and crucially, occurring between 9 AM and 5 PM UTC ("hour_of_day": {"gte": 9, "lt": 17}). The avg_duration_ms aggregation calculates the average duration in microseconds.
The condition checks if this average duration, divided by 1000 to convert microseconds to milliseconds, is greater than our threshold_ms of 500. If it is, the send_notification action fires an email.
The power here lies in the ability to combine these granular filters. We’re not just alerting on high latency; we’re alerting on high latency under specific, business-relevant conditions like peak hours and regional deployment. This prevents alert fatigue by only signaling problems that matter most in a given context.
What most people miss is how deeply you can nest conditions and aggregations. For instance, you could further refine this by checking the transaction’s http.request.method and the service.environment field, or even aggregate by service.name within the POST /checkout transaction name if you have multiple checkout services. This allows for incredibly precise, context-aware alerting that traditional monitoring tools struggle to match.
The next step is often to aggregate by error rate within the same time window to ensure you’re not just seeing slow transactions but also problematic ones.