The Flink JobManager is unavailable because it failed to register with the Zookeeper ensemble, which is its designated service discovery mechanism.

Here are the common reasons for this failure and how to fix them:

1. Zookeeper Connection Issues

  • Diagnosis: Check Flink JobManager logs for ZooKeeperConnectionException or similar messages indicating Zookeeper connection failures. On the Zookeeper server, check its logs for connection attempts from the JobManager’s IP and port.
    # On Flink JobManager node
    grep -i "zookeeper" /var/log/flink/jobmanager.log | grep -i "error"
    
    # On Zookeeper server node
    tail -f /var/log/zookeeper/zookeeper.log | grep "connection from"
    
  • Cause: The JobManager cannot reach the Zookeeper ensemble due to network firewall rules, incorrect Zookeeper host/port configuration in Flink, or the Zookeeper ensemble itself is unhealthy.
  • Fix:
    • Network: Ensure the JobManager nodes can reach the Zookeeper nodes on the configured Zookeeper client port (default is 2181).
      # From JobManager node, replace <zookeeper_host> and <zookeeper_port>
      nc -vz <zookeeper_host> <zookeeper_port>
      
      If this fails, update firewall rules to allow traffic.
    • Configuration: Verify flink-conf.yaml on the JobManager has the correct Zookeeper quorum.
      # Example: flink-conf.yaml
      zookeeper.client.connect: zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
      
      Restart the JobManager after any configuration changes.
    • Zookeeper Health: Ensure the Zookeeper ensemble is running and healthy. Check Zookeeper status on each server.
      # On Zookeeper server, run in netcat or telnet
      echo "stat" | nc localhost 2181
      
      Look for Mode: follower or Mode: leader. If any are in Mode: standalone or not responding, troubleshoot the Zookeeper cluster.

2. Incorrect Zookeeper Root Path

  • Diagnosis: JobManager logs might show errors related to creating or accessing a Zookeeper path, often mentioning NoNodeException or KeeperException.
    grep -i "zookeeper" /var/log/flink/jobmanager.log | grep -E "path|node"
    
  • Cause: Flink uses a root path in Zookeeper to store its cluster metadata. If this path is not correctly configured or if Zookeeper has been reinitialized and the path was lost, the JobManager won’t be able to register.
  • Fix: Ensure the zookeeper.root.path in flink-conf.yaml is set and that this path exists (or Flink has permissions to create it) in Zookeeper. The default is /flink.
    # Example: flink-conf.yaml
    zookeeper.root.path: /flink-cluster-prod
    
    If you are changing this path, ensure no other Flink cluster is using the new path, and that the old path is cleaned up if it’s no longer needed.

3. Zookeeper Session Expiration/Timeout

  • Diagnosis: Look for log messages like Session unexpectedly expired or connection loss in the JobManager logs, followed by attempts to re-register.
    grep -i "session expired" /var/log/flink/jobmanager.log
    
  • Cause: The JobManager’s Zookeeper session expired. This can happen if the JobManager is stuck for too long (e.g., due to heavy load or a network partition), or if Zookeeper’s session timeout is too short for the JobManager’s typical heartbeat.
  • Fix:
    • JobManager Responsiveness: Ensure the JobManager is not overloaded or stuck. If it is, investigate the cause of the overload (e.g., insufficient resources, problematic user jobs).
    • Zookeeper Session Timeout: Increase the Zookeeper session timeout if it’s too aggressive. This is configured in Zookeeper’s zoo.cfg file using tickTime and initLimit/syncLimit. A common setting is tickTime=2000 and sessionTimeout=60000 (60 seconds).
      # Example: zoo.cfg on Zookeeper server
      tickTime=2000
      initLimit=10
      syncLimit=5
      sessionTimeout=60000
      
      Restart Zookeeper and then the Flink JobManager after changing these values.

4. Insufficient Zookeeper Permissions

  • Diagnosis: JobManager logs might show KeeperException.NoAuthException or KeeperException.NoPermissionException when trying to create Zookeeper nodes.
    grep -i "zookeeper" /var/log/flink/jobmanager.log | grep -i "permission"
    
  • Cause: The Zookeeper server is configured with ACLs (Access Control Lists), and the user running the Flink JobManager does not have the necessary permissions to create or modify nodes under the zookeeper.root.path.
  • Fix: Configure Zookeeper ACLs to grant the Flink process the required permissions (e.g., read, write, create, delete) for the specified zookeeper.root.path. This is typically done within Zookeeper’s zookeeper.properties or by setting ACLs directly using Zookeeper client commands.
    # Example Zookeeper client command to set ACLs (run from zkClient)
    # This is a simplified example, actual ACLs depend on your security setup
    addauth digest flinkuser:flinkpassword
    setAcl /flink-cluster-prod auth:flinkuser:cdrwa
    
    Ensure the Flink JobManager is configured to authenticate with Zookeeper if authentication is enabled.

5. Zookeeper Ensemble Not Fully Started or Undergoing Leader Election

  • Diagnosis: During startup, Zookeeper servers might be in a state of startup or re-election, making them temporarily unavailable for registration. Check the Zookeeper server logs for starting up or election messages.
    # On Zookeeper server
    tail -f /var/log/zookeeper/zookeeper.log | grep -E "election|starting up"
    
  • Cause: The Zookeeper ensemble might not have reached a stable quorum, or a leader election is in progress, meaning no server is ready to accept client connections for writes.
  • Fix: Wait for the Zookeeper ensemble to fully start and elect a leader. Ensure the initLimit and syncLimit in zoo.cfg are set appropriately for your network latency and cluster size to allow for proper leader election and synchronization. Restart the Flink JobManager only after Zookeeper is confirmed to be healthy and stable.

6. Zookeeper Data Directory Issues

  • Diagnosis: Zookeeper server logs might show errors related to disk I/O, file system corruption, or inability to write to its data directory.
    # On Zookeeper server
    tail -f /var/log/zookeeper/zookeeper.log | grep -i "data directory"
    
  • Cause: The Zookeeper data directory is full, corrupted, or has permission issues, preventing Zookeeper from persisting its state or processing requests.
  • Fix:
    • Disk Space: Ensure the disk where Zookeeper stores its data is not full.
    • Permissions: Verify the user running the Zookeeper process has full read/write permissions to the data directory.
    • Corruption: In severe cases of corruption, you might need to reinitialize the Zookeeper ensemble (which will result in data loss if not properly backed up or if it’s a new setup). Ensure the Zookeeper ensemble is healthy and its data directory is accessible and writable.

If all these are addressed, the next error you’ll likely encounter relates to the TaskManagers failing to register with the JobManager, indicating that the JobManager is now running but the communication channel to TaskManagers is broken.

Want structured learning?

Take the full Flink course →