Elastic APM traces can be correlated with infrastructure metrics by using the service.name and host.name fields to join trace data with metrics collected by Elastic Agent’s infrastructure integration.
Here’s how it works in practice.
Imagine you’re looking at an APM trace in Kibana. You see a specific transaction, say an HTTP request to /api/v1/users, that’s taking way too long. You want to know why. Was it a problem with the application code itself, or was the underlying server struggling?
To find out, you’d navigate to the "Infrastructure" app in Kibana. You’d filter by the same service.name (e.g., my-backend-service) and host.name (e.g., webserver-01) that you saw in your APM trace. Now you’re looking at the metrics for that specific host at the time the slow transaction occurred.
Here’s a sample trace from the APM app:
{
"trace.id": "a1b2c3d4e5f67890",
"transaction.id": "f0e9d8c7b6a54321",
"transaction.name": "GET /api/v1/users",
"service.name": "my-backend-service",
"host.name": "webserver-01",
"timestamp": "2023-10-27T10:30:00Z",
"duration_ms": 5500
}
And here’s a corresponding infrastructure metrics document for webserver-01 at roughly the same time:
{
"metricset.name": "system",
"host.name": "webserver-01",
"service.name": "my-backend-service",
"timestamp": "2023-10-27T10:30:05Z",
"system.cpu.total.pct": 0.95,
"system.memory.actual.used.pct": 0.88,
"system.network.in.bytes_per_sec": 150000,
"system.diskio.total.bytes_per_sec": 20000
}
By default, Elastic Agent automatically enriches both APM data and infrastructure metrics with fields like service.name and host.name. This is usually configured within the Elastic Agent policy. When you’re viewing a trace in APM, you’ll often see a link or a button that says "View in Infrastructure" or similar. Clicking this leverages these common fields to automatically filter the Infrastructure app to show you the metrics for the host and service associated with that trace.
The primary problem this solves is attribution. When an application is slow, it’s rarely just one thing. Is it the code? The database? The network? The CPU? By correlating APM traces with infrastructure metrics, you can see if a spike in CPU utilization (system.cpu.total.pct), memory pressure (system.memory.actual.used.pct), or network I/O (system.network.in.bytes_per_sec) coincides with slow transactions. This allows you to pinpoint whether the bottleneck is within the application’s code or its underlying environment.
Internally, Elastic APM collects transaction and span data. Elastic Agent, when configured with the infrastructure integration, collects system-level metrics (CPU, memory, network, disk) and process-level metrics. Both data types are sent to Elasticsearch and indexed. Kibana’s APM and Infrastructure apps are built to query these indices and, crucially, to understand how to join them using common identifying fields. The service.name field is critical because it identifies the specific application being monitored by APM, and the host.name (or host.id) identifies the physical or virtual machine.
The exact levers you control are primarily in your Elastic Agent configuration. For the APM integration, you ensure the service.name is consistently set across all instances of your application. For the infrastructure integration, you ensure the agent is running on the relevant hosts and that the collected metrics are being sent. The service.name field for infrastructure metrics is often derived from the APM service name, or it can be explicitly set in the agent policy if you have specific naming conventions.
A common pitfall is assuming host.name will always be identical. Different environments or host provisioning methods might result in variations. For instance, a containerized application might have a host.name that’s a container ID, while the host metrics might report the underlying VM’s name. In such cases, you might need to enrich your data with a common identifier during ingestion, perhaps by using agent environment variables or log forwarding to add a consistent kubernetes.node.name or a custom environment.host_id field to both data streams.
The next logical step after correlating infrastructure metrics with APM traces is to dive into distributed tracing, examining how requests flow across multiple services and identifying inter-service communication bottlenecks.