The Flink JobManager pod is failing because the Kubernetes API server is rejecting its requests to manage TaskManager pods, citing insufficient permissions.

This usually happens when the flink-jobmanager service account, which the JobManager pod runs as, doesn’t have the necessary RBAC permissions to interact with the Kubernetes API for creating, deleting, or listing pods.

  1. Missing cluster-admin role or equivalent:

    • Diagnosis: Check the service account’s role bindings. If you don’t see a ClusterRoleBinding for cluster-admin or a custom ClusterRole that grants create, delete, get, list, and watch permissions on pods and pods/log resources within the Flink namespace, this is likely the issue.
      kubectl auth can-i create pods --as=system:serviceaccount:flink:flink-jobmanager
      # This should return 'yes'
      
    • Fix: Grant the necessary permissions. The easiest (though least secure) way is to bind the cluster-admin role to the service account.
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: flink-jobmanager-cluster-admin-binding
      subjects:
      - kind: ServiceAccount
        name: flink-jobmanager
        namespace: flink # Replace with your Flink namespace
      roleRef:
        kind: ClusterRole
        name: cluster-admin
        apiGroup: rbac.authorization.k8s.io
      
      Apply this with kubectl apply -f <filename>.yaml. This allows the JobManager to perform any action on any resource, which is often too broad.
    • Why it works: The cluster-admin ClusterRole grants all possible permissions across the cluster. By binding this role to the flink-jobmanager service account, the JobManager pod is authorized to perform any Kubernetes API operation, including managing TaskManager pods.
  2. Insufficient custom ClusterRole permissions:

    • Diagnosis: If you’ve created a custom ClusterRole for Flink, verify it grants the required permissions. The JobManager needs to manage pods (create, delete, get, list, watch) and potentially get logs from pods.
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        name: flink-manager-role
      rules:
      - apiGroups: [""]
        resources: ["pods", "pods/log"]
        verbs: ["create", "delete", "get", "list", "watch", "patch", "update"]
      - apiGroups: ["apps"]
        resources: ["deployments", "statefulsets"]
        verbs: ["get", "list", "watch"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: flink-manager-binding
      subjects:
      - kind: ServiceAccount
        name: flink-jobmanager
        namespace: flink # Replace with your Flink namespace
      roleRef:
        kind: ClusterRole
        name: flink-manager-role
        apiGroup: rbac.authorization.k8s.io
      
      Check if the flink-manager-role (or whatever you named it) is correctly applied and if the verbs and resources cover what the JobManager needs.
    • Fix: Add the missing permissions to your custom ClusterRole. Ensure pods and pods/log are included with create, delete, get, list, watch, and potentially patch/update.
      # ... (previous rules)
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create", "delete", "get", "list", "watch", "patch", "update"]
      - apiGroups: [""]
        resources: ["pods/log"]
        verbs: ["get", "list"]
      # Add other necessary resources like services, endpoints etc. if needed by your Flink setup
      
      Apply the updated ClusterRole and ClusterRoleBinding.
    • Why it works: This grants the Flink JobManager only the necessary permissions to manage its own pods and related resources, following the principle of least privilege.
  3. Incorrect namespace in ClusterRoleBinding or RoleBinding:

    • Diagnosis: If you’re using a RoleBinding instead of a ClusterRoleBinding (which is generally not recommended for JobManager as it needs to see pods across namespaces or potentially create them in a different namespace), or if your ClusterRoleBinding is targeting the wrong namespace for the subjectsServiceAccount, the permissions won’t apply correctly.
      kubectl get clusterrolebinding flink-manager-binding -o yaml
      # Check the 'subjects' section for the correct namespace of the service account.
      
    • Fix: Ensure the namespace field within the subjects of the ClusterRoleBinding accurately reflects where the flink-jobmanager service account resides. If you intend to use a Role and RoleBinding (for a JobManager that only manages TaskManagers within its own namespace), ensure the RoleBinding is in the same namespace as the JobManager and TaskManager pods.
      # Example for RoleBinding (if JobManager and TaskManagers are in the same namespace)
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: flink-manager-role-binding
        namespace: flink # Namespace where Flink is deployed
      subjects:
      - kind: ServiceAccount
        name: flink-jobmanager
        namespace: flink # Namespace of the service account
      roleRef:
        kind: Role
        name: flink-manager-role # A Role defined in the 'flink' namespace
        apiGroup: rbac.authorization.k8s.io
      
    • Why it works: RBAC bindings connect a role (what actions are allowed) to subjects (who can perform them) within a specific scope (namespace for RoleBinding, cluster for ClusterRoleBinding). An incorrect namespace in the binding means the service account isn’t associated with the intended permissions.
  4. Service Account not specified or incorrect in JobManager deployment:

    • Diagnosis: The Flink JobManager pod might not be configured to use the flink-jobmanager service account at all, or it’s configured to use a different, unprivileged one.
      kubectl get pod <flink-jobmanager-pod-name> -n flink -o yaml | grep serviceAccountName
      
      The output should be serviceAccountName: flink-jobmanager.
    • Fix: Explicitly set the serviceAccountName in your Flink JobManager deployment (e.g., in the Kubernetes Deployment or StatefulSet manifest).
      apiVersion: apps/v1
      kind: Deployment # or StatefulSet
      metadata:
        name: flink-jobmanager
        namespace: flink
      spec:
        template:
          spec:
            serviceAccountName: flink-jobmanager # Ensure this matches your service account
            containers:
            - name: flink-jobmanager
              image: flink:latest # your Flink image
              # ... other container config
      
      Reapply the deployment manifest.
    • Why it works: Kubernetes assigns the permissions defined for the serviceAccountName to the pod. If the wrong or no service account is specified, the pod runs with default (often insufficient) permissions.
  5. Network Policies blocking API Server access:

    • Diagnosis: While less common for the JobManager itself to be blocked from the API server (as it’s usually within the cluster network), if you have very strict network policies, it’s theoretically possible the JobManager pods are prevented from reaching the Kubernetes API server endpoint.
      # This is hard to diagnose directly with a simple command.
      # You'd look at NetworkPolicy resources in the flink namespace and any other relevant namespaces.
      kubectl get networkpolicy -n flink
      
    • Fix: Ensure there’s a NetworkPolicy that allows egress traffic from the flink-jobmanager pods to the Kubernetes API server (typically kubernetes.default.svc.cluster.local on port 443).
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-egress-to-kubernetes-api
        namespace: flink # Namespace where Flink JobManager runs
      spec:
        podSelector:
          matchLabels:
            app: flink-jobmanager # Label for your JobManager pods
        policyTypes:
        - Egress
        egress:
        - to:
          - ipBlock:
              cidr: 10.96.0.0/12 # Default Service CIDR, adjust if yours is different
          ports:
          - protocol: TCP
            port: 443 # Kubernetes API server port
        - to:
          - namespaceSelector: {} # Allow access to all pods in other namespaces (if needed)
            podSelector:
              matchLabels:
                # Labels for Kubernetes API server pods or services
                component: apiserver
      
    • Why it works: NetworkPolicies act as firewalls at the IP and port level (OSI layer 3/4). This policy explicitly permits the JobManager pods to communicate with the API server.
  6. Flink Configuration: kubernetes.cluster-id mismatch or missing:

    • Diagnosis: Flink uses the kubernetes.cluster-id to identify itself within the Kubernetes cluster. If this is not set correctly or is missing, Flink might not be able to properly register or manage its resources, leading to unexpected behavior that can manifest as pod failures.
      # Check Flink configuration via Flink UI or logs
      # Or check the Flink JobManager pod's environment variables or config files
      kubectl exec <flink-jobmanager-pod-name> -n flink -- cat /opt/flink/conf/flink-conf.yaml
      
      Look for kubernetes.cluster-id. It should be set to a unique identifier for your Flink deployment.
    • Fix: Set a unique kubernetes.cluster-id in your Flink configuration. This is often derived from the Flink deployment name or a specific identifier you choose.
      # In flink-conf.yaml or via Kubernetes ConfigMap/environment variables
      kubernetes.cluster-id: my-flink-cluster-12345
      
      Restart the JobManager pod.
    • Why it works: This ID helps Flink distinguish its own resources from others in the Kubernetes cluster, ensuring it only attempts to manage its own TaskManager pods and doesn’t interfere with other applications or Flink instances.

After fixing these, you’ll likely encounter issues related to TaskManager pod scheduling or connectivity if they aren’t also configured correctly, or perhaps the Flink application itself will start failing if there are logical errors in your job graph.

Want structured learning?

Take the full Flink course →