Long-running distributed services don’t just "use memory"; they manage it, and when that management fails, it’s often a slow, insidious creep, not a sudden crash.

Let’s say you have a service that’s supposed to process messages from a queue and store them in a database. It’s been running for weeks, and suddenly, it’s consuming gigabytes of RAM it shouldn’t be. This isn’t a bug in the message processing itself, but in how the service remembers things it no longer needs.

Here’s a common scenario: a cache that’s supposed to evict old entries but never does.

{
  "serviceName": "message-processor",
  "version": "1.2.3",
  "instanceId": "msg-proc-abc123",
  "metrics": {
    "memory": {
      "heapUsageBytes": 5368709120,
      "maxHeapSize": 8589934592,
      "nonHeapUsageBytes": 1073741824
    },
    "cpu": {
      "usagePercent": 75.2
    },
    "queueDepth": 15000
  },
  "logLevel": "INFO"
}

This metrics object is what you’re looking at. The heapUsageBytes is the key. If this number steadily climbs over time, and your service isn’t expected to grow its working set indefinitely, you’ve got a leak.

The Usual Suspects (and How to Pin Them Down)

1. Unbounded Caches: This is the classic. You’re caching results, but the eviction policy is either missing or broken.

  • Diagnosis: If your language has a way to inspect the heap, look for a large number of objects of a specific type (e.g., HashMap$Node in Java, or dict entries in Python) that shouldn’t be there. Tools like jmap (Java), memory_profiler (Python), or pprof (Go) can dump heap information.
  • Fix: Implement or fix your cache’s eviction strategy. For example, in Java with Guava, use CacheBuilder.newBuilder().maximumSize(1000).build(). This ensures the cache never holds more than 1000 entries.
  • Why it works: The cache now has a hard limit. When it’s full and a new item is added, the oldest item is automatically removed, preventing indefinite growth.

2. Lingering References (Garbage Collection Hoarding): Even if you’re not using an explicit cache, you might be holding onto objects that are no longer needed, preventing the garbage collector (GC) from reclaiming their memory. This often happens with event listeners, callbacks, or static collections.

  • Diagnosis: Use a heap dump analyzer. Look for objects that are unexpectedly reachable. For instance, you might see many ThreadLocal objects that are no longer being cleared, or event listeners that are still registered to an object that should have been de-referenced.
  • Fix: In your code, ensure you explicitly remove listeners when they are no longer needed. For ThreadLocal variables, call threadLocalVariable.remove() in a finally block or when the thread’s work is complete.
  • Why it works: By removing the references, you make the objects eligible for garbage collection, allowing the GC to free up the associated memory.

3. Resource Handles Not Being Closed: Things like database connections, file handles, or network sockets, if not properly closed, can consume memory and other system resources. While often categorized as resource leaks, they can manifest as memory growth if the underlying system or libraries hold onto associated buffers.

  • Diagnosis: Monitor the number of open file descriptors (lsof -p <pid> | wc -l) or database connections. If these numbers grow unboundedly, you have a leak.
  • Fix: Ensure all try-with-resources blocks (Java), with statements (Python), or defer statements (Go) are used correctly to close resources. For connections, use connection pooling and ensure connections are returned to the pool when done.
  • Why it works: Explicitly closing resources releases the underlying operating system handles and associated memory buffers, preventing accumulation.

4. Serialization/Deserialization Buffers: If your service serializes and deserializes large amounts of data (e.g., JSON, Protobuf) and doesn’t properly clear or reuse the buffers, these can accumulate.

  • Diagnosis: Heap dumps might show large collections of byte arrays or string builders that are no longer referenced by your active application logic but are still held in memory.
  • Fix: Review your serialization/deserialization code. Ensure that temporary buffers are released immediately after use. Some libraries offer ways to manage buffer pools for reuse.
  • Why it works: Releasing or reusing buffers prevents them from staying in memory longer than necessary, especially if they are large.

5. Weakly-Referenced Data Structures: Sometimes, developers use WeakHashMap or similar structures expecting them to automatically clean up. However, if the keys (or values) are still strongly referenced elsewhere, the weakly-referenced entry won’t be collected.

  • Diagnosis: Examine your heap dump for WeakHashMap instances and check the references to their keys. If the keys are still held strongly by other parts of your application, the WeakHashMap entry will persist.
  • Fix: Identify the strong reference that’s keeping the key alive and remove it if it’s no longer needed, or reconsider using a WeakHashMap if the data is truly meant to be ephemeral.
  • Why it works: By breaking the unintended strong reference, you allow the GC to collect the key and, consequently, the entry in the WeakHashMap.

6. Incorrect JVM/Runtime Configuration: Especially in Java, incorrect GC tuning or heap sizing can look like a memory leak. If the GC isn’t running frequently enough, or if the heap is too small, memory usage will climb until the service is eventually OOM’d.

  • Diagnosis: Monitor GC activity. High GC pause times and very frequent GCs can indicate a problem. Use JVM flags like -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to log GC events.
  • Fix: Adjust JVM heap size (-Xmx, -Xms) and tune GC algorithms (e.g., G1GC, Shenandoah). For example, increasing -Xmx to 12g might be necessary if the working set genuinely requires it, but if it’s a leak, tuning GC might help delay the inevitable while you fix the root cause.
  • Why it works: Proper GC configuration ensures that memory is reclaimed efficiently. If the memory is being leaked, better GC might just buy you time. If it’s just high usage, adequate sizing and efficient collection are key.

After fixing these, the next thing you’ll likely encounter is increased latency as the GC has to work harder to keep up with the (now correctly managed) memory churn.

Want structured learning?

Take the full Distributed Systems course →