The most surprising thing about Kubernetes scheduling is that it’s not just about where a pod can go, but where it should go, and how to enforce that with increasing specificity.
Let’s watch a workflow pod land on a specific node. Imagine we have a Workflow custom resource, and its controller creates a Pod for each step. We want to ensure a particular step, say data-processing, always runs on nodes that have GPUs.
Here’s a simplified Workflow manifest:
apiVersion: example.com/v1
kind: Workflow
metadata:
name: gpu-workflow
spec:
steps:
- name: data-ingestion
image: ubuntu:latest
command: ["echo", "Ingesting data..."]
- name: data-processing
image: tensorflow/tensorflow:latest-gpu
command: ["python", "process.py"]
nodeSelector:
accelerator: gpu
When the Workflow controller creates the Pod for data-processing, it includes a nodeSelector. This is the simplest way to target nodes. The scheduler will look for any node with a label accelerator=gpu.
If you kubectl get nodes --show-labels, you might see output like this:
NAME STATUS ROLES AGE VERSION LABELS
node-1 Ready <none> 5d v1.28.2 kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=cpu
node-2 Ready <none> 5d v1.28.2 kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=gpu
node-3 Ready <none> 5d v1.28.2 kubernetes.io/arch=amd64,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,accelerator=gpu
The data-processing pod will only be scheduled onto node-2 or node-3.
But what if you need more control? What if you want to make sure the data-processing pod runs on the same node as the data-ingestion pod, or at least on nodes that are similar? That’s where affinity and anti-affinity come in.
Let’s modify our Workflow to use podAffinity:
apiVersion: example.com/v1
kind: Workflow
metadata:
name: gpu-workflow
spec:
steps:
- name: data-ingestion
image: ubuntu:latest
command: ["echo", "Ingesting data..."]
# Let's assume this pod might get scheduled on any node for now
- name: data-processing
image: tensorflow/tensorflow:latest-gpu
command: ["python", "process.py"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: workflow.example.com/name
operator: In
values:
- gpu-workflow
topologyKey: kubernetes.io/hostname
Here, requiredDuringSchedulingIgnoredDuringExecution means the pod must be scheduled on a node that satisfies the condition, but if the node’s labels change later, the pod isn’t evicted. The topologyKey: kubernetes.io/hostname means "find a node where a pod with the specified labels is already running, and schedule this pod on the same node." The labelSelector targets pods belonging to our specific workflow.
So, if data-ingestion lands on node-2, data-processing will also try to land on node-2. If data-ingestion was on node-3, data-processing would try for node-3.
You can also use preferredDuringSchedulingIgnoredDuringExecution for softer preferences, giving the scheduler some flexibility.
The key levers you control are:
nodeSelector: Simple key-value label matching for nodes.affinity.nodeAffinity: More expressive rules for node selection (e.g.,In,NotIn,Exists,DoesNotExist).affinity.podAffinity: Rules about where pods can or cannot be scheduled relative to other pods already running on the cluster.affinity.podAntiAffinity: The inverse of pod affinity – ensuring pods are scheduled on different nodes or in different availability zones.topologyKey: Defines the scope for pod affinity/anti-affinity (e.g.,kubernetes.io/hostnamefor the same node,topology.kubernetes.io/zonefor the same zone).labelSelector: Within affinity rules, this specifies which other pods you’re interested in.
The scheduler uses a two-phase approach: filtering and scoring. nodeSelector, requiredDuringSchedulingIgnoredDuringExecution (node and pod affinity), and requiredDuringSchedulingIgnoredDuringExecution (pod anti-affinity) are all filtering predicates. If a node fails these, it’s removed from consideration. For the remaining nodes, the scheduler applies scoring functions based on preferredDuringSchedulingIgnoredDuringExecution rules, assigning weights to nodes. The node with the highest score wins.
A common pitfall is over-constraining your scheduling rules. If you require a pod to be on a node with a specific GPU and that specific node must also host another pod from the same workflow, and there are no such nodes available, your pod will remain Pending indefinitely. You might see errors like 0/N nodes are available: N node(s) didn't match node selector, N node(s) had Taints, N node(s) didn't match pod affinity/anti-affinity.
The next step in mastering scheduling is understanding taints and tolerations, which allow nodes to repel pods unless the pods explicitly tolerate the taint.