The Elastic APM Service Overview in Kibana is not just a dashboard; it’s a diagnostic tool that actively reconstructs the flow of requests through your distributed systems, revealing bottlenecks and failures with surprising clarity.
Let’s watch it in action. Imagine a user requests a product page on your e-commerce site. This request doesn’t just hit one server; it might trigger calls to a product catalog service, an inventory service, a pricing service, and finally, a payment gateway. Elastic APM, through its agents running in your application code, captures each of these steps as a "trace."
Here’s a simplified representation of what that trace might look like in Kibana’s APM UI:
[Trace: a1b2c3d4e5f6]
[Span: GET /products/{id} (Backend)] - 50ms
[Span: Fetch product details (Service A)] - 20ms
[Span: Query database (Service A)] - 15ms
[Span: Check inventory (Service B)] - 25ms
[Span: Call inventory API (Service B)] - 20ms
[Span: Get pricing (Service C)] - 5ms
[Span: Call pricing API (Service C)] - 3ms
This visual hierarchy shows the original request (the top-level span) and all the sub-requests (child spans) it spawned. You can immediately see that "Check inventory (Service B)" is taking the longest.
The problem this solves is the "black box" nature of modern microservice architectures. When a user reports slow performance, tracing allows you to pinpoint exactly which service or operation is the culprit, rather than guessing or sifting through logs across dozens of machines.
Internally, APM agents instrument your code. For a Java application, this might involve Java Agents that use bytecode manipulation to hook into method calls. For Node.js, it’s often done via require hooks. These agents record the start and end times of specific operations (spans) and their relationships (parent-child). This data is then sent to the APM Server, which processes and indexes it in Elasticsearch. Kibana then queries Elasticsearch to visualize these traces.
The key levers you control are:
- Sampling Rate: APM agents don’t necessarily capture 100% of traces, especially under heavy load, to avoid overwhelming your system. You configure how many traces to sample. A common setting in Kibana’s agent configurations might look like this for a Java agent:
-Delastic.apm.transaction_sample_rate=0.1(capturing 10% of transactions). - Agent Configuration: You can enable or disable specific types of instrumentation, set service names, and configure the APM Server endpoint. For a Node.js agent, this is typically done in your application’s entry point:
const apm = require('elastic-apm-node').start({ serviceName: 'my-frontend-app', serverUrl: 'http://localhost:8200', environment: 'production', transactionSampleRate: 0.5 // Capture 50% of transactions }); - Kibana Filtering and Aggregation: Within Kibana, you can filter traces by service, endpoint, HTTP status code, duration, and more. You can also aggregate data to see average response times, error rates per service, and identify outliers.
The one thing most people don’t know is that the "duration" shown for a span is not just the time spent in that specific piece of code. It includes the time spent waiting for any child spans to complete. This is crucial: a span might appear long not because it’s doing heavy work, but because it’s waiting on a slow downstream service. This is why the hierarchical view is so powerful – it immediately reveals where the waiting is happening.
The next concept to explore is how to configure distributed tracing correlation, linking APM traces to logs and metrics.