The most surprising thing about upgrading AKS clusters with zero downtime is that you don’t actually upgrade the cluster in place.
Let’s see this in action. Imagine we have a running AKS cluster, my-aks-cluster, in the eastus region, with a node pool named nodepool-1 running Kubernetes version 1.26.3. We want to upgrade to 1.27.5.
Here’s a snapshot of our current state (conceptually, not actual output):
# Get current node pool version
az aks nodepool show --resource-group my-resource-group --cluster-name my-aks-cluster --name nodepool-1 --query 'orchestrationVersion'
"1.26.3"
# Get current Kubernetes version of the control plane
az aks show --resource-group my-resource-group --name my-aks-cluster --query 'kubernetesVersion'
"1.26.3"
To upgrade with zero downtime, we’ll create a new node pool with the target Kubernetes version and then migrate our workloads.
First, create the new node pool:
az aks nodepool add \
--resource-group my-resource-group \
--cluster-name my-aks-cluster \
--name nodepool-2 \
--node-count 3 \
--kubernetes-version 1.27.5 \
--mode User \
--os-type Linux \
--vnet-subnet-id "/subscriptions/YOUR_SUB_ID/resourceGroups/my-resource-group/providers/Microsoft.Network/virtualNetworks/my-vnet/subnets/my-aks-subnet" \
--tags environment=production
This creates a new set of nodes running the desired 1.27.5 version. Notice we’re adding a new node pool, not modifying the existing one. The control plane itself is upgraded independently, but the node pools are where the actual compute for your pods lives.
Once nodepool-2 is ready and healthy, we can start migrating our applications. This is the critical part. We leverage Kubernetes’ built-in mechanisms like Pod Disruption Budgets (PDBs) and readiness/liveness probes.
The strategy is to cordon and drain the old node pool (nodepool-1). Cordoning marks the nodes as unschedulable, preventing new pods from landing there. Draining evicts existing pods gracefully.
# Cordon the old node pool
az aks nodepool update \
--resource-group my-resource-group \
--cluster-name my-aks-cluster \
--name nodepool-1 \
--no-wait \
--labels aks-nodepool-operation=drain
# Wait for cordon to complete (this might take a minute or two)
# You can monitor node status using kubectl get nodes
# Drain the old node pool
kubectl drain aks-nodepool-operation=drain --ignore-daemonsets --delete-emptydir-data
As pods are evicted from nodepool-1, Kubernetes will reschedule them onto available nodes, which will include the new, healthy nodes in nodepool-2. Because your applications should have PDBs configured, they will ensure a minimum number of replicas are available throughout the drain process, preventing a complete outage. Readiness probes ensure that new pods are fully functional before traffic is routed to them.
After all application pods have been successfully migrated to nodepool-2, you can delete the old nodepool-1.
az aks nodepool delete \
--resource-group my-resource-group \
--cluster-name my-aks-cluster \
--name nodepool-1 \
--yes
Finally, you upgrade the AKS control plane itself. This is a separate operation from upgrading the node pools.
az aks upgrade \
--resource-group my-resource-group \
--name my-aks-cluster \
--kubernetes-version 1.27.5 \
--yes
The control plane upgrade is generally very fast and has minimal impact on running workloads, as the actual pod execution happens on the nodes. Once the control plane is upgraded, you can add new node pools with the latest Kubernetes version or upgrade existing ones using the same add-new-pool-and-migrate strategy.
The mental model here is one of blue-green deployment or phased rollout applied to your cluster infrastructure. You create a completely new, parallel environment (nodepool-2 with the target version) and then carefully shift your workload traffic (kubectl drain) to it, verifying health at each step. The control plane upgrade is the final step that aligns the cluster’s management layer with the new node version.
The key to making this seamless is having robust application-level resilience. If your applications don’t have correctly configured Pod Disruption Budgets, or if their readiness probes are too aggressive or not configured at all, the "graceful" eviction will feel much more like a disruptive failure, even if the underlying AKS upgrade mechanism is sound.
Once your control plane and all node pools are on 1.27.5, you’ll naturally start looking at upgrading to the next patch or minor version, repeating this process.