The Horizontal Pod Autoscaler (HPA) doesn’t actually scale your EKS cluster itself; it only tells the Kubernetes control plane to scale the number of pods within a Deployment or StatefulSet. The cluster autoscaler is a separate component that actually adds or removes worker nodes based on pending pods.
Let’s see it in action. Imagine we have a simple web application deployed to EKS.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 1
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
cpu: "100m" # 100 millicpu
memory: "128Mi" # 128 Mebibytes
This deployment starts with just one web-app pod. Now, we want to tell Kubernetes to automatically scale this Deployment based on CPU utilization. We create an HPA resource:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Target 50% CPU utilization across all pods
This HPA object tells Kubernetes: "For the web-app Deployment, keep the number of replicas between 1 and 10. If the average CPU utilization across all web-app pods goes above 50%, add more pods. If it drops below 50%, remove pods."
To make this work, your pods need to be requesting resources (like CPU and memory) as shown in the Deployment YAML above. The HPA calculates utilization based on these requests. If a pod requests 100m of CPU and is currently using 75m, its utilization is 75%. The HPA aggregates this across all pods for the target Deployment.
Now, how do we actually generate load to trigger scaling? We can use a tool like hey or wrk to hit our application. First, we need to get the service endpoint for our web-app.
kubectl get service web-app-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Let’s assume this returns web-app-service.elb.amazonaws.com. We can then send traffic:
# Install hey if you don't have it: go install github.com/rakyll/hey@latest
hey -z 5m -c 100 http://web-app-service.elb.amazonaws.com
This command will send 100 concurrent requests for 5 minutes to our service. As the web-app pods start consuming more CPU to handle this load, the HPA controller, which runs within the Kubernetes control plane, periodically checks the metrics. When the average CPU utilization across the initial single pod exceeds 50%, the HPA will update the replicas field in the web-app Deployment.
You’ll see this change by running:
kubectl get deployment web-app
The DESIRED count will go from 1 to 2, then potentially higher as the load persists and the average CPU utilization remains above the target. Kubernetes will then create new pods. If these new pods are scheduled onto existing nodes, and those nodes have sufficient capacity, they’ll start up. If there aren’t enough resources on existing nodes for the new pods, the cluster autoscaler (if configured) will kick in to provision new EKS worker nodes.
The HPA controller’s default resync period is 15 seconds. It fetches metrics from the metrics-server (which itself scrapes metrics from kubelets on each node). The averageUtilization metric is calculated as sum(current_usage) / sum(requested_resources) across all pods for the target.
The most surprising true thing about HPAs is that they don’t inherently know how to scale your cluster. They are purely about managing pod counts for a given workload. If your pods become resource-hungry but there are no available CPU or memory resources on any existing worker nodes, the new pods will remain in a Pending state, waiting for nodes to be added. This is where the Cluster Autoscaler becomes crucial for EKS environments, as it watches for these Pending pods and requests new EC2 instances from AWS to join your EKS worker node group.
When you configure an HPA, you are essentially setting a policy for how your application’s resource consumption should dictate its availability. The minReplicas and maxReplicas act as hard boundaries, ensuring you never scale below a baseline or above a cost/capacity limit. The target metric (like averageUtilization) is the dynamic trigger, allowing your application to gracefully handle fluctuating demand by adjusting its footprint.
One thing that often trips people up is the difference between Utilization and Value targets in HPA metrics. While Utilization (e.g., 50% CPU) is common for CPU and memory, Value (e.g., a specific number of requests per second) is used for custom metrics or external metrics. For Utilization to work correctly, your pod resources.requests must be set. Without them, Kubernetes can’t calculate the percentage. If you target CPU utilization and your pods have no CPU requests, they will be assumed to have a request of 0, and the utilization will be effectively infinite, leading to immediate scaling if any CPU is used at all, or no scaling if there’s no CPU usage.
The next thing you’ll likely want to configure is scaling based on custom metrics, such as the number of items in a Redis queue or the latency of your API.