The Flink TaskManager lost its connection to the JobManager because the TaskManager’s heartbeats to the JobManager were not being acknowledged within the configured timeout period. This indicates a network issue or a resource starvation problem on either the TaskManager or JobManager side preventing timely communication.

Common Causes and Fixes

  1. Network Partition/Firewall Issues:

    • Diagnosis: On the TaskManager, run ping <jobmanager_ip> and telnet <jobmanager_ip> 6123. If ping fails or telnet times out, there’s a network connectivity problem. Check firewall rules on both TaskManager and JobManager hosts, as well as any network devices in between.
    • Fix: Ensure that UDP/TCP port 6123 (default Flink RPC port) is open bidirectionally between TaskManagers and the JobManager. On most Linux systems, you can use sudo ufw allow 6123/tcp and sudo ufw allow 6123/udp if using UFW, or equivalent commands for other firewalls.
    • Why it works: This directly addresses network blockages, allowing the essential RPC communication for heartbeats and task management to flow freely.
  2. TaskManager Resource Starvation (CPU/Memory):

    • Diagnosis: On the affected TaskManager, use top or htop to check CPU utilization. If it’s consistently at or near 100%, or if free -h shows very low free memory and high swap usage, the TaskManager is struggling. Look for Flink logs on the TaskManager for OutOfMemoryError or excessive garbage collection pauses.
    • Fix:
      • Increase TaskManager Memory: In flink-conf.yaml on the TaskManager, increase taskmanager.memory.process.size (e.g., from 1024m to 2048m or higher). Restart the TaskManager.
      • Reduce Task Load: If the TaskManager is running too many tasks, reduce the parallelism of your Flink job or increase the number of TaskManagers.
    • Why it works: Sufficient CPU and memory allow the TaskManager’s JVM to run its threads, including the network I/O threads responsible for sending heartbeats and processing JobManager responses, without being blocked by resource contention or excessive GC.
  3. JobManager Resource Starvation (CPU/Memory):

    • Diagnosis: Similar to TaskManager starvation, check top/htop and free -h on the JobManager host. A busy JobManager can’t process incoming heartbeats from TaskManagers quickly enough. Examine JobManager logs for OutOfMemoryError or GC issues.
    • Fix:
      • Increase JobManager Memory: In flink-conf.yaml on the JobManager, increase jobmanager.memory.process.size. Restart the JobManager.
      • Reduce JobManager Load: If the JobManager is overwhelmed by too many running jobs or very frequent job submissions/cancellations, consider scaling up the JobManager instance or distributing the load if using HA.
    • Why it works: A healthy JobManager is crucial for acknowledging heartbeats. If it’s bogged down, it can’t respond to TaskManagers, leading to perceived disconnections.
  4. Incorrect taskmanager.rpc.port or jobmanager.rpc.port Configuration:

    • Diagnosis: Verify that the taskmanager.rpc.port setting in the TaskManager’s flink-conf.yaml matches the jobmanager.rpc.port setting (or the default 6123 if not explicitly set) in the JobManager’s flink-conf.yaml. Also, ensure the JobManager is configured to listen on the correct network interface if jobmanager.bind-host is set.
    • Fix: Ensure consistency. If JobManager is on port 6123 and TaskManagers are configured to connect to it on port 6123, this is correct. If you’ve changed jobmanager.rpc.port to something else (e.g., 6124), update taskmanager.rpc.port on all TaskManagers to match. Restart affected components.
    • Why it works: TaskManagers need to know the correct address and port to send their RPC requests (including heartbeats) to the JobManager. Mismatched ports prevent them from establishing the necessary communication channel.
  5. Network Latency or Packet Loss:

    • Diagnosis: Use ping -c 100 <jobmanager_ip> from a TaskManager and ping -c 100 <taskmanager_ip> from the JobManager. Look for high average latency (consistently > 50ms can be problematic) and packet loss (any loss is bad). Tools like mtr can provide more detailed path analysis.
    • Fix: Investigate network infrastructure between the nodes. This might involve upgrading network hardware, optimizing routing, or moving nodes to a more proximate network segment.
    • Why it works: Flink’s RPC communication, including heartbeats, has a timeout. High latency or dropped packets mean heartbeats take too long to arrive or are lost entirely, triggering the timeout and disconnection.
  6. cluster.taskmanager.heartbeat.interval and cluster.taskmanager.heartbeat.timeout Misconfiguration:

    • Diagnosis: Check the values for cluster.taskmanager.heartbeat.interval (default 10000ms) and cluster.taskmanager.heartbeat.timeout (default 60000ms) in flink-conf.yaml. If the interval is too high or the timeout too low relative to expected network conditions or TaskManager processing delays, disconnections can occur.
    • Fix: Increase cluster.taskmanager.heartbeat.timeout or decrease cluster.taskmanager.heartbeat.interval. For example, setting cluster.taskmanager.heartbeat.timeout: 120000 (2 minutes) provides more leeway. Restart Flink components.
    • Why it works: These parameters define how often TaskManagers send heartbeats and how long the JobManager waits before considering a TaskManager dead. Adjusting them creates a wider window for communication to occur, especially in environments with transient network issues or slow processing.

After fixing these, the next error you’ll likely encounter is a java.lang.OutOfMemoryError if the underlying resource issue was severe enough to corrupt internal state, or a org.apache.flink.runtime.operators.coordination.OperatorCoordinator.failJob if the disconnection led to a critical loss of coordination state for a specific operator.

Want structured learning?

Take the full Flink course →