Elastic APM Server is choking on memory and dropping requests under load.
The APM Server is failing to keep up with the volume of trace data being sent by your applications, leading to excessive memory consumption and eventual request rejections. This typically happens when the server’s internal queues fill up because the processing rate can’t match the ingestion rate.
Here are the common culprits and how to fix them:
1. Insufficient Heap Size: The Java Virtual Machine (JVM) heap is where APM Server stores its active data and caches. If it’s too small, the garbage collector will run constantly, or the server will OOM (Out Of Memory).
- Diagnosis: Check the APM Server logs for
OutOfMemoryErroror excessive garbage collection pauses. You can also check the JVM heap usage viahttp://localhost:8200/metrics(look forjvm.mem.heap.usedvs.jvm.mem.heap.max). - Fix: Increase the
ES_JAVA_OPTSenvironment variable. For example, to set the heap to 4GB:
Restart the APM Server for this to take effect. This gives the JVM more contiguous memory to work with, reducing GC pressure and allowing more data to be held in memory before being processed.export ES_JAVA_OPTS="-Xms4g -Xmx4g" - Why it works: A larger heap allows APM Server to buffer more incoming data and perform its internal processing tasks without immediately needing to reclaim memory.
2. Overloaded Ingestion Pipelines: APM Server uses ingest pipelines to process and enrich incoming data before sending it to Elasticsearch. If these pipelines are too complex or inefficient, they become a bottleneck.
- Diagnosis: Examine your APM Server’s ingest pipelines in Kibana under
Stack Management -> Ingest Pipelines. Look for pipelines with many processors, especially expensive ones likegeoiporuser_agenton high-volume data. Check APM Server logs for slow processing times or warnings about pipeline execution. - Fix: Simplify or disable unnecessary processors in your APM pipelines. For instance, if you don’t need geographic information for every single transaction, remove the
geoipprocessor.
Apply the modified pipeline to your APM data streams. This reduces the CPU and memory overhead per document, allowing APM Server to process more documents.// Example of a simplified pipeline (remove geoip if not needed) { "processors": [ { "set": { "field": "agent.name", "value": "{{agent.name}}" } } // ... other essential processors ] } - Why it works: Each processor in a pipeline adds overhead. By removing or optimizing slow processors, you decrease the work APM Server needs to do for each incoming event.
3. High max_concurrent_outbound_connections:
APM Server sends processed data to Elasticsearch. If this value is set too high, it can overwhelm Elasticsearch or APM Server’s own network buffers.
- Diagnosis: Check your
apm-server.ymlconfiguration file foroutput.elasticsearch.max_concurrent_outgoings_connections. If it’s not set, it defaults to a reasonable value, but if it’s been manually increased, that could be the issue. - Fix: Reduce
output.elasticsearch.max_concurrent_outgoings_connectionsinapm-server.yml. Start by lowering it to4or8and observing performance.
Restart APM Server. This limits the number of simultaneous requests APM Server makes to Elasticsearch, preventing it from overwhelming the Elasticsearch cluster or its own connection pool.output.elasticsearch: hosts: ["http://localhost:9200"] max_concurrent_outgoings_connections: 8 - Why it works: This parameter directly controls the concurrency of APM Server’s requests to Elasticsearch. Lowering it reduces the load APM Server places on Elasticsearch and its own network stack.
4. Inadequate queue_size for Inbound Requests:
APM Server uses an in-memory queue to buffer incoming requests before they are processed. If this queue is too small, it will fill up quickly under high load.
- Diagnosis: Monitor the
apm.server.request.queue.sizemetric in APM Server’s metrics endpoint (http://localhost:8200/metrics). If this metric is consistently near its maximum capacity, the queue is too small. - Fix: Increase the
queue_sizeinapm-server.yml. A common starting point for high-volume systems is4096.
Restart APM Server. This allows APM Server to buffer more incoming requests before it starts rejecting them, giving the processing threads more time to catch up.queue_size: 4096 - Why it works: A larger queue provides a buffer, smoothing out bursts of incoming traffic and preventing APM Server from immediately dropping requests when the processing rate momentarily lags behind the ingestion rate.
5. Network Latency or Bandwidth Issues: High latency or insufficient bandwidth between APM Server and your agents/clients, or between APM Server and Elasticsearch, can cause requests to pile up.
- Diagnosis: Use
pingandtraceroutefrom the APM Server to your client IPs and Elasticsearch IPs. Check network interface utilization (ifconfigorip a) on the APM Server. - Fix: Address underlying network infrastructure problems. This might involve optimizing routing, increasing bandwidth, or moving APM Server and Elasticsearch closer in the network topology. Ensure there are no firewalls or network devices introducing excessive latency or packet loss.
- Why it works: Reliable and fast network communication is crucial. Slowdowns here directly translate to longer processing times for requests and responses, leading to backlogs.
6. Too Many Agents Sending Data: While not a configuration issue, if you have an unexpectedly high number of agents (e.g., due to a misconfiguration causing agents to restart rapidly) sending data, the aggregate load might exceed the APM Server’s capacity.
- Diagnosis: Check the number of active agents reporting to APM Server. You can often see this in Kibana’s APM UI or by querying APM Server metrics for agent counts.
- Fix: Investigate why so many agents are active. If it’s a misconfiguration, correct it on the client side. If it’s legitimate, you may need to scale out your APM Server horizontally by running multiple instances behind a load balancer.
- Why it works: Horizontal scaling distributes the load across multiple APM Server instances, each handling a subset of the incoming traffic.
After applying these fixes, you might encounter the next common issue: Request timeout: context deadline exceeded.