The most surprising thing about Kubernetes scheduling is that it’s not just about where a pod can go, but where it should go, and how to enforce that with increasing specificity.

Let’s watch a workflow pod land on a specific node. Imagine we have a Workflow custom resource, and its controller creates a Pod for each step. We want to ensure a particular step, say data-processing, always runs on nodes that have GPUs.

Here’s a simplified Workflow manifest:

apiVersion: example.com/v1
kind: Workflow
metadata:
  name: gpu-workflow
spec:
  steps:
    - name: data-ingestion
      image: ubuntu:latest
      command: ["echo", "Ingesting data..."]
    - name: data-processing
      image: tensorflow/tensorflow:latest-gpu
      command: ["python", "process.py"]
      nodeSelector:
        accelerator: gpu

When the Workflow controller creates the Pod for data-processing, it includes a nodeSelector. This is the simplest way to target nodes. The scheduler will look for any node with a label accelerator=gpu.

If you kubectl get nodes --show-labels, you might see output like this:

NAME       STATUS   ROLES    AGE   VERSION   LABELS
node-1     Ready    <none>   5d    v1.28.2   kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=cpu
node-2     Ready    <none>   5d    v1.28.2   kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=gpu
node-3     Ready    <none>   5d    v1.28.2   kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=gpu

The data-processing pod will only be scheduled onto node-2 or node-3.

But what if you need more control? What if you want to make sure the data-processing pod runs on the same node as the data-ingestion pod, or at least on nodes that are similar? That’s where affinity and anti-affinity come in.

Let’s modify our Workflow to use podAffinity:

apiVersion: example.com/v1
kind: Workflow
metadata:
  name: gpu-workflow
spec:
  steps:
    - name: data-ingestion
      image: ubuntu:latest
      command: ["echo", "Ingesting data..."]
      # Let's assume this pod might get scheduled on any node for now
    - name: data-processing
      image: tensorflow/tensorflow:latest-gpu
      command: ["python", "process.py"]
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: workflow.example.com/name
                    operator: In
                    values:
                      - gpu-workflow
              topologyKey: kubernetes.io/hostname

Here, requiredDuringSchedulingIgnoredDuringExecution means the pod must be scheduled on a node that satisfies the condition, but if the node’s labels change later, the pod isn’t evicted. The topologyKey: kubernetes.io/hostname means "find a node where a pod with the specified labels is already running, and schedule this pod on the same node." The labelSelector targets pods belonging to our specific workflow.

So, if data-ingestion lands on node-2, data-processing will also try to land on node-2. If data-ingestion was on node-3, data-processing would try for node-3.

You can also use preferredDuringSchedulingIgnoredDuringExecution for softer preferences, giving the scheduler some flexibility.

The key levers you control are:

  • nodeSelector: Simple key-value label matching for nodes.
  • affinity.nodeAffinity: More expressive rules for node selection (e.g., In, NotIn, Exists, DoesNotExist).
  • affinity.podAffinity: Rules about where pods can or cannot be scheduled relative to other pods already running on the cluster.
  • affinity.podAntiAffinity: The inverse of pod affinity – ensuring pods are scheduled on different nodes or in different availability zones.
  • topologyKey: Defines the scope for pod affinity/anti-affinity (e.g., kubernetes.io/hostname for the same node, topology.kubernetes.io/zone for the same zone).
  • labelSelector: Within affinity rules, this specifies which other pods you’re interested in.

The scheduler uses a two-phase approach: filtering and scoring. nodeSelector, requiredDuringSchedulingIgnoredDuringExecution (node and pod affinity), and requiredDuringSchedulingIgnoredDuringExecution (pod anti-affinity) are all filtering predicates. If a node fails these, it’s removed from consideration. For the remaining nodes, the scheduler applies scoring functions based on preferredDuringSchedulingIgnoredDuringExecution rules, assigning weights to nodes. The node with the highest score wins.

A common pitfall is over-constraining your scheduling rules. If you require a pod to be on a node with a specific GPU and that specific node must also host another pod from the same workflow, and there are no such nodes available, your pod will remain Pending indefinitely. You might see errors like 0/N nodes are available: N node(s) didn't match node selector, N node(s) had Taints, N node(s) didn't match pod affinity/anti-affinity.

The next step in mastering scheduling is understanding taints and tolerations, which allow nodes to repel pods unless the pods explicitly tolerate the taint.

Want structured learning?

Take the full Argo-workflows course →