Flink’s JVM Metaspace is out of memory because the Java Virtual Machine (JVM) is no longer able to allocate native memory for the Java class metadata, which is essential for loading and managing classes. This typically happens when your Flink jobs load an excessive number of classes or when the garbage collector can’t reclaim enough memory from the Metaspace.

  1. Excessive Class Loading Due to Frequent Job Restarts or Dynamic Code Loading:

    • Diagnosis: Monitor the Metaspace usage over time using Flink’s Web UI or JMX tools (like jcmd <pid> GC.heap_info). Look for a steady, upward trend in Metaspace usage that never decreases significantly.
    • Fix: Increase the Metaspace size. In your Flink job submission or flink-conf.yaml, set env.java.opts: "-XX:MaxMetaspaceSize=512m". For a cluster, edit conf/flink-env.sh and add FLINK_ENV_JAVA_OPTS="-XX:MaxMetaspaceSize=512m".
    • Why it works: This directly allocates more memory to the Metaspace, allowing it to hold more class metadata before hitting the limit. The value 512m is a starting point; adjust based on observed usage.
  2. Large Number of User-Defined Functions (UDFs) or External Libraries:

    • Diagnosis: If your job uses many custom UDFs or includes numerous external JARs, this is a prime suspect. You can inspect your job’s dependencies or review the code for extensive use of libraries that might load many classes.
    • Fix: Review your UDFs and dependencies. Consolidate common logic into fewer classes if possible. Remove unused libraries. If unavoidable, increase MaxMetaspaceSize as described above.
    • Why it works: Reducing the sheer number of distinct classes that need to be loaded by the JVM alleviates the pressure on Metaspace.
  3. PermGen Space (Older JVMs) Misconfiguration:

    • Diagnosis: If you are using a very old JVM (pre-Java 8), you might be hitting java.lang.OutOfMemoryError: PermGen space instead of Metaspace. Check your JVM version (java -version).
    • Fix: For Java 7 and earlier, you need to increase PermSize and MaxPermSize. In flink-conf.yaml or flink-env.sh, set env.java.opts: "-XX:MaxPermSize=256m".
    • Why it works: Similar to Metaspace, this increases the memory available for class metadata in older JVMs.
  4. Classloader Leaks in Long-Running Jobs:

    • Diagnosis: This is subtler. Metaspace usage might climb and then plateau, but never return to a baseline even after garbage collection. This can indicate that classloaders are not being properly garbage collected, and thus the classes they loaded remain in Metaspace. Tools like jmap and jhat can help analyze heap dumps for classloader leaks, though it’s complex.
    • Fix: Identify the source of the leak. Often, this is due to static references held by objects that are never released, preventing the classloader from being GC’d. Refactor code to remove such static references. For specific frameworks or libraries within Flink, consult their documentation for known classloader leak issues.
    • Why it works: By removing references that prevent classloaders from being collected, the JVM can reclaim the associated Metaspace.
  5. Flink Internal Class Loading Issues (Less Common):

    • Diagnosis: If you’ve ruled out user code and external libraries, it’s possible Flink itself or its internal components are contributing to high class loading. This is rare in stable Flink versions but could occur with custom Flink builds or specific plugin interactions.
    • Fix: Ensure you are running a stable, recommended Flink version. If using plugins or custom extensions, test with them disabled to isolate the problem. If a specific Flink version is implicated, consider upgrading or downgrading.
    • Why it works: Using a well-tested Flink version minimizes the risk of bugs in Flink’s own class loading mechanisms.
  6. Insufficient MaxMetaspaceSize for Normal Operation:

    • Diagnosis: Even with well-behaved code, complex Flink jobs with many operators, state backends, and network connections can legitimately require a significant amount of Metaspace. Monitor Metaspace usage under normal load. If it consistently hovers near the default limit (often around 128MB for the JVM default) and then spikes to OOM, the default is too low.
    • Fix: Increase MaxMetaspaceSize to a more generous value, such as 512m or 1024m, based on your monitoring. For example, env.java.opts: "-XX:MaxMetaspaceSize=1024m".
    • Why it works: Provides ample headroom for Flink’s operational needs without forcing unnecessary class reloading or garbage collection cycles.

After resolving the Metaspace OOM, you might encounter java.lang.OutOfMemoryError: Java heap space if your job’s data processing or state management is consuming too much heap memory.

Want structured learning?

Take the full Flink course →