The Flink JobManager is failing to load user-defined code from submitted JARs, preventing jobs from starting.

Here are the common reasons this happens and how to fix them:

1. JARs Not Included in the lib Directory

Diagnosis: Check the Flink cluster’s lib directory on the JobManager and TaskManagers. If your user code JAR is not present, this is likely the issue.

Fix: Copy your user code JAR file into the lib directory of your Flink installation on all nodes (JobManager and TaskManagers). For example, if your Flink installation is at /opt/flink, you’d do:

cp /path/to/your-user-code.jar /opt/flink/lib/

Why it works: Flink’s default classloader configuration looks for user code JARs in this lib directory. Placing them here makes them visible to the Flink runtime.

2. Classpath Conflicts within the User Code JAR

Diagnosis: If your user code JAR includes libraries that are already present in Flink’s own dependencies (e.g., Guava, SLF4j), you can get ClassNotFoundException or NoClassDefFoundError even if the JAR is in the lib directory. The Flink UI’s logs for the JobManager or TaskManagers will show these errors.

Fix: Shade your user code JAR to exclude conflicting dependencies. Use the maven-shade-plugin in your pom.xml:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.4</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <manifestEntries>
                                    <Main-Class>org.apache.flink.client.program.Client</Main-Class>
                                    <Main-Class>org.apache.flink.container.entrypoint.FlinkContainerEntrypoint</Main-Class>
                                </manifestEntries>
                            </transformer>
                        </transformers>
                        <artifactSet>
                            <includes>
                                <!-- Include your project's artifacts -->
                                <include>com.yourcompany:your-artifact-id</include>
                            </includes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <!-- Exclude SLF4j to avoid conflicts -->
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>org/slf4j/**</exclude>
                                </excludes>
                            </filter>
                        </filters>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Replace com.yourcompany:your-artifact-id with your actual project coordinates. The exclusion of org/slf4j/** is a common example; identify other conflicting libraries.

Why it works: Shading merges your code and its dependencies into a single JAR, while specifically removing libraries that are already provided by Flink. This ensures Flink’s versions are used, preventing conflicts.

3. Incorrect JAR Submission Method

Diagnosis: If you are submitting your job using flink run or the Flink REST API, and the JAR isn’t correctly specified or passed.

Fix: When submitting, explicitly list your user code JAR and any other required dependency JARs using the -C or --classpath option.

./bin/flink run \
  -C file:///path/to/your-user-code.jar \
  -C file:///path/to/another-dependency.jar \
  /path/to/your-job-jar.jar \
  --job-argument1 value1

Alternatively, if submitting a fat JAR (shaded JAR with all dependencies), you can just provide the JAR path:

./bin/flink run /path/to/your-shaded-user-code.jar --job-argument1 value1

Why it works: The -C option tells Flink to add the specified JARs to the classpath for the execution environment, making your code and its dependencies discoverable.

Diagnosis: If you’ve configured Flink to load user code via flink-conf.yaml using pipeline.jars, but the configuration is incorrect or missing.

Fix: Edit your flink-conf.yaml and ensure the pipeline.jars property points to the correct location of your JAR.

pipeline.jars: "file:///opt/flink/lib/your-user-code.jar,file:///opt/flink/lib/another-dependency.jar"

If you are using a fat JAR, it can be just:

pipeline.jars: "file:///opt/flink/lib/your-shaded-user-code.jar"

Restart the Flink cluster after modifying flink-conf.yaml.

Why it works: This configuration directly tells Flink which JARs to load into the user code classloader when a job starts.

5. Issues with Custom Classloaders or Parallel Classloading

Diagnosis: In advanced scenarios, you might have custom classloader configurations or Flink’s parallel classloading enabled (classloader.parent-first-mode: false). This can lead to unexpected loading behaviors if not managed carefully.

Fix: If you suspect custom classloader configurations, temporarily disable them or revert to the default parent-first mode in flink-conf.yaml:

# To revert to default parent-first mode
classloader.parent-first-mode: true

If you are using custom classloader setups, ensure your JARs are correctly placed and your custom logic correctly handles the resolution of dependencies. For parallel classloading, ensure all dependencies are self-contained or explicitly managed.

Why it works: The parent-first mode is the standard Java classloading delegation model, which usually prevents conflicts. Disabling it or managing it incorrectly can break dependency resolution.

6. Corrupted or Incomplete JAR File

Diagnosis: The JAR file itself might be corrupted during transfer or creation. Verify the file’s integrity (e.g., using md5sum or sha256sum) on the submission client and on the Flink cluster nodes.

Fix: Re-download or re-build your user code JAR file. Ensure the transfer to the Flink cluster nodes is complete and without errors.

Why it works: A corrupted JAR cannot be correctly read or parsed by the Java Virtual Machine, leading to various errors, including class loading failures.

The next error you’ll likely encounter if all classloading issues are resolved is a TaskExecutorProcessExitException if the TaskManager itself crashes due to a runtime error in your user code.

Want structured learning?

Take the full Flink course →