EKS clusters don’t actually "run" Kubernetes; they run a managed control plane that orchestrates your worker nodes running Kubernetes.
Here’s how a safe upgrade looks, focusing on the control plane first, then the nodes. The control plane upgrade is a simple click or API call in AWS, but it’s the worker node upgrade that requires careful orchestration.
Let’s walk through upgrading an EKS cluster from 1.27 to 1.28.
The Control Plane Upgrade
This is the easy part. You initiate this via the AWS console or the AWS CLI.
aws eks update-cluster-version \
--name my-eks-cluster \
--kubernetes-version 1.28 \
--region us-east-1
AWS handles the control plane upgrade. This means the API server, etcd, controller manager, and scheduler are updated. Your cluster will be unavailable for brief periods during this process, typically a few minutes. The key here is that AWS manages this upgrade with high availability. They don’t upgrade all instances of each control plane component simultaneously. They roll out the upgrade, ensuring at least one instance of each component is always available.
The Worker Node Upgrade
This is where the real work is. You can’t just upgrade the nodes; you need to replace them with new nodes running the newer Kubernetes version. EKS doesn’t magically update the kubelet and container runtime on your existing instances.
1. Prepare Your Node Groups
You’ll likely have one or more managed or self-managed node groups. For a safe rollout, you’ll want to create a new node group configured for the target Kubernetes version (1.28 in this case) and then drain the old nodes.
Create a New Managed Node Group:
aws eks create-nodegroup \
--cluster-name my-eks-cluster \
--nodegroup-name ng-1-28-workers \
--subnets subnet-xxxxxxxxxxxxxxxxx subnet-yyyyyyyyyyyyyyyyy \
--instance-types t3.medium \
--ami-type AL2_x86_64 \
--release-version <latest-ami-version-for-1.28> \
--kubernetes-version 1.28 \
--scaling-config minSize=2,maxSize=5,desiredSize=2 \
--disk-size 100 \
--region us-east-1
--release-version <latest-ami-version-for-1.28>: This is crucial. You need to specify the EKS optimized AMI version that corresponds to your target Kubernetes version. You can find these in the AWS documentation or by usingaws eks describe-release-versions --kubernetes-version 1.28 --region us-east-1.--ami-type AL2_x86_64: Ensure you’re using the correct AMI type (Amazon Linux 2, Bottlerocket, etc.).
2. Gradual Rollout and Draining
Once the new node group is up and running, you can start migrating your workloads. The safest way is to cordon and drain the nodes in your old node group one by one.
Cordoning a Node:
kubectl cordon <old-node-name>
This prevents new pods from being scheduled onto that node.
Draining a Node:
kubectl drain <old-node-name> --ignore-daemonsets --delete-local-data
--ignore-daemonsets: DaemonSets will run on all nodes, so draining a node with a DaemonSet doesn’t make sense for that specific workload. This flag tellskubectl drainto proceed even if DaemonSet pods are present.--delete-local-data: This flag is important for pods usingemptyDirvolumes. It ensures that the data in these volumes is deleted when the pod is evicted, preventing potential data loss or corruption if the pod is rescheduled elsewhere.
3. Scaling Down Old Node Groups
After you’ve drained all the nodes in an old node group and confirmed your workloads are running on the new nodes, you can scale down and eventually delete the old node group.
If it’s a managed node group:
aws eks update-nodegroup-config \
--cluster-name my-eks-cluster \
--nodegroup-name ng-1-27-workers \
--scaling-config minSize=0,maxSize=0,desiredSize=0 \
--region us-east-1
Then, delete the node group:
aws eks delete-nodegroup \
--cluster-name my-eks-cluster \
--nodegroup-name ng-1-27-workers \
--region us-east-1
If it’s a self-managed node group (e.g., using Auto Scaling Groups), you would simply scale down the ASG to zero and then terminate the instances.
4. Repeat for Other Node Groups
Perform this process for each of your node groups, creating new ones for 1.28 and draining the old 1.27 ones.
The One Thing Most People Don’t Know
When you drain a node, kubectl drain attempts to gracefully terminate pods. However, if a pod has a terminationGracePeriodSeconds set to a very high value (e.g., hours), or if the pod’s preStop hook takes a long time to execute, the drain operation can hang indefinitely. It’s essential to monitor your kubectl drain commands and be prepared to force-evict pods if necessary, though this should be a last resort.
After all nodes are upgraded and workloads are running on the new version, you’ll be ready to tackle the next version.