Elastic APM’s latency percentile dashboards in Kibana are surprisingly effective at surfacing performance issues you wouldn’t see with simple averages.

Let’s see this in action. Imagine you’re monitoring a web service. You’ve got Elastic APM set up, sending traces to Elasticsearch. In Kibana, you navigate to the APM section, then to "Services" and select your service. From there, you go to "Transactions." By default, you’ll see average response times. But that average can be misleading. A few extremely slow requests can skew the average, masking a larger number of requests that are consistently a bit too slow, even if not catastrophic.

This is where percentiles shine. Instead of an average, we want to know, for example, what the response time is for the 95th percentile (p95) or 99th percentile (p99) of requests. This tells us that 95% or 99% of our requests are faster than this value. If your p95 is significantly higher than your average, it indicates a tail of slow requests that needs investigation.

Here’s how you’d build a useful dashboard.

First, we need to ensure our APM data is being indexed correctly in Elasticsearch. Elastic APM agent typically sends transaction data with fields like @timestamp, service.name, transaction.name, transaction.type, and transaction.duration.us. The transaction.duration.us field is crucial here, as it stores the transaction duration in microseconds.

To visualize percentiles, we’ll leverage Kibana’s Lens or Visualize Library. Let’s use Lens for its interactive nature.

  1. Navigate to Visualize Library: In Kibana, go to "Visualize Library" and click "Create visualization."
  2. Choose Lens: Select "Lens" as your visualization type.
  3. Select Data View: Choose your APM transaction data view (e.g., apm-*).
  4. Configure the Chart:
    • Y-axis: Drag the transaction.duration.us metric to the Y-axis. Change the aggregation from "Average" to "Percentile." In the percentile input, enter 95 for the 95th percentile. You can also add 99 for the 99th percentile.
    • X-axis: Drag @timestamp to the X-axis. Kibana will likely default to a "Date Histogram" aggregation, which is what we want for time-series data. Adjust the interval if needed (e.g., auto, 1h, 1d).
    • Break down by: To understand which transactions are slow, drag transaction.name or transaction.type to the "Break down by" section. This will create separate lines on your chart for each distinct transaction or type.
    • Filter: Add a filter for transaction.type if you want to focus on specific types, like request for web endpoints.

This setup will give you a line chart showing the p95 and p99 latency of your transactions over time, broken down by transaction name. You can then save this as a dashboard panel.

Key Levers to Control:

  • Percentile Value: The primary lever is the percentile you choose (p50, p75, p90, p95, p99). Higher percentiles reveal more extreme outliers.
  • Transaction Granularity: Breaking down by transaction.name is essential. A single slow endpoint can inflate the overall percentile if not isolated.
  • Time Range: Always adjust the time range to match your investigation period.
  • Filters: Apply filters for specific services, environments, or error rates to narrow down the scope.

The true power comes when you combine these percentile visualizations with other metrics. For instance, you might have a dashboard with:

  • Average Latency: For a baseline understanding.
  • p95/p99 Latency: To spot the long tail.
  • Error Rate: To correlate latency spikes with errors.
  • Throughput (Requests per Minute): To see if latency increases under load.

When you’re looking at a p99 latency spike, and you see a corresponding spike in transaction.duration.us for a specific transaction.name on the "Break down by" axis, you’ve found your culprit. The next step is to dive into the APM transaction traces for that specific transaction name and time window to see the detailed breakdown of what took so long – database queries, external HTTP calls, or internal processing.

A common pitfall is to only look at average latency and miss the fact that 10% of your users are experiencing significantly worse performance than the rest. Percentiles directly address this by telling you the experience of your slowest users.

If you’re seeing an anomaly in your percentile charts, the next step is to correlate that with specific slow spans within those transactions using the APM trace details.

Want structured learning?

Take the full Elastic-apm course →