Flink’s JVM Metaspace is out of memory because the Java Virtual Machine (JVM) is no longer able to allocate native memory for the Java class metadata, which is essential for loading and managing classes. This typically happens when your Flink jobs load an excessive number of classes or when the garbage collector can’t reclaim enough memory from the Metaspace.
Common Causes and Fixes for Flink JVM Metaspace OOM Errors
-
Excessive Class Loading Due to Frequent Job Restarts or Dynamic Code Loading:
- Diagnosis: Monitor the Metaspace usage over time using Flink’s Web UI or JMX tools (like
jcmd <pid> GC.heap_info). Look for a steady, upward trend in Metaspace usage that never decreases significantly. - Fix: Increase the Metaspace size. In your Flink job submission or
flink-conf.yaml, setenv.java.opts: "-XX:MaxMetaspaceSize=512m". For a cluster, editconf/flink-env.shand addFLINK_ENV_JAVA_OPTS="-XX:MaxMetaspaceSize=512m". - Why it works: This directly allocates more memory to the Metaspace, allowing it to hold more class metadata before hitting the limit. The value
512mis a starting point; adjust based on observed usage.
- Diagnosis: Monitor the Metaspace usage over time using Flink’s Web UI or JMX tools (like
-
Large Number of User-Defined Functions (UDFs) or External Libraries:
- Diagnosis: If your job uses many custom UDFs or includes numerous external JARs, this is a prime suspect. You can inspect your job’s dependencies or review the code for extensive use of libraries that might load many classes.
- Fix: Review your UDFs and dependencies. Consolidate common logic into fewer classes if possible. Remove unused libraries. If unavoidable, increase
MaxMetaspaceSizeas described above. - Why it works: Reducing the sheer number of distinct classes that need to be loaded by the JVM alleviates the pressure on Metaspace.
-
PermGen Space (Older JVMs) Misconfiguration:
- Diagnosis: If you are using a very old JVM (pre-Java 8), you might be hitting
java.lang.OutOfMemoryError: PermGen spaceinstead of Metaspace. Check your JVM version (java -version). - Fix: For Java 7 and earlier, you need to increase
PermSizeandMaxPermSize. Inflink-conf.yamlorflink-env.sh, setenv.java.opts: "-XX:MaxPermSize=256m". - Why it works: Similar to Metaspace, this increases the memory available for class metadata in older JVMs.
- Diagnosis: If you are using a very old JVM (pre-Java 8), you might be hitting
-
Classloader Leaks in Long-Running Jobs:
- Diagnosis: This is subtler. Metaspace usage might climb and then plateau, but never return to a baseline even after garbage collection. This can indicate that classloaders are not being properly garbage collected, and thus the classes they loaded remain in Metaspace. Tools like
jmapandjhatcan help analyze heap dumps for classloader leaks, though it’s complex. - Fix: Identify the source of the leak. Often, this is due to static references held by objects that are never released, preventing the classloader from being GC’d. Refactor code to remove such static references. For specific frameworks or libraries within Flink, consult their documentation for known classloader leak issues.
- Why it works: By removing references that prevent classloaders from being collected, the JVM can reclaim the associated Metaspace.
- Diagnosis: This is subtler. Metaspace usage might climb and then plateau, but never return to a baseline even after garbage collection. This can indicate that classloaders are not being properly garbage collected, and thus the classes they loaded remain in Metaspace. Tools like
-
Flink Internal Class Loading Issues (Less Common):
- Diagnosis: If you’ve ruled out user code and external libraries, it’s possible Flink itself or its internal components are contributing to high class loading. This is rare in stable Flink versions but could occur with custom Flink builds or specific plugin interactions.
- Fix: Ensure you are running a stable, recommended Flink version. If using plugins or custom extensions, test with them disabled to isolate the problem. If a specific Flink version is implicated, consider upgrading or downgrading.
- Why it works: Using a well-tested Flink version minimizes the risk of bugs in Flink’s own class loading mechanisms.
-
Insufficient
MaxMetaspaceSizefor Normal Operation:- Diagnosis: Even with well-behaved code, complex Flink jobs with many operators, state backends, and network connections can legitimately require a significant amount of Metaspace. Monitor Metaspace usage under normal load. If it consistently hovers near the default limit (often around 128MB for the JVM default) and then spikes to OOM, the default is too low.
- Fix: Increase
MaxMetaspaceSizeto a more generous value, such as512mor1024m, based on your monitoring. For example,env.java.opts: "-XX:MaxMetaspaceSize=1024m". - Why it works: Provides ample headroom for Flink’s operational needs without forcing unnecessary class reloading or garbage collection cycles.
After resolving the Metaspace OOM, you might encounter java.lang.OutOfMemoryError: Java heap space if your job’s data processing or state management is consuming too much heap memory.