Scale Pods Automatically with HPA in AKS (2026)

Kubernetes Horizontal Pod Autoscaler (HPA) can actually make your application less available if you don’t understand its core assumptions about resource utilization.

Let’s see it in action. Imagine we have a simple web service deployed in AKS.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "100m" # Request 0.1 CPU core
            memory: "128Mi" # Request 128 MiB of memory
          limits:
            cpu: "200m" # Limit to 0.2 CPU cores
            memory: "256Mi" # Limit to 256 MiB of memory

Now, we want to autoscale this based on CPU utilization. We’ll create an HPA object:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50 # Target 50% CPU utilization

The targetCPUUtilizationPercentage is the key. This HPA will monitor the current CPU utilization of the pods in the web-app deployment and compare it against the requests.cpu defined in the pod’s spec. If the average CPU utilization across all pods exceeds 50% of their requested CPU (i.e., 50% of 100m, which is 50m), the HPA will try to scale up. Conversely, if it drops below 50%, it will scale down, all the way to minReplicas.

The problem is that "utilization" here is calculated as current_cpu / requested_cpu. This means if you set your CPU requests too low, even a small amount of actual work will look like high utilization, triggering unnecessary scaling. If you set them too high, the pods will never reach the target utilization, and scaling won’t happen.

Here’s how it works internally:

Metrics Collection: The Kubernetes Metrics Server (which AKS provides and manages) collects resource metrics (CPU and memory) from all pods.
HPA Controller: The HPA controller periodically queries the Metrics Server for the relevant metrics for the scaleTargetRef.
Calculation: For CPU, it calculates the average CPU utilization across all pods for the target deployment. This is done by taking the current CPU usage of each pod and dividing it by the CPU requested for that pod. This ratio is then averaged across all pods.
Scaling Decision: The controller compares this average utilization to the targetCPUUtilizationPercentage. If (average_current_cpu / average_requested_cpu) * 100 > targetCPUUtilizationPercentage, it scales up. If it’s significantly lower, it scales down.
Scale Operation: If a scaling decision is made, the HPA controller updates the replicas field in the scaleTargetRef (our Deployment). Kubernetes then handles the creation or deletion of pods to match the new replica count.

The most surprising thing about HPA is that it doesn’t directly measure how busy your application is in terms of actual work done, but rather how much CPU it’s consuming relative to what it asked for. This is why correctly setting resources.requests is absolutely critical. If your pods are consistently using 150m of CPU, but you’ve only requested 100m, your utilization will appear to be 150%, triggering a scale-up even if the pod is perfectly capable of handling more load within its limits. Conversely, if you request 500m and the pod only uses 50m, you’re at 10% utilization, and HPA won’t scale up.

When you’re defining your HPA, remember that the targetCPUUtilizationPercentage is a percentage of the requested CPU, not the limited CPU, and not some absolute measure of "busyness." This distinction is crucial for effective autoscaling. If you want to scale based on something other than CPU or memory utilization (like requests per second or queue length), you’ll need to look into custom metrics or external metrics, which are more complex to set up.

The next thing you’ll likely run into is issues with memory autoscaling, which has a similar, but distinct, set of gotchas related to how memory utilization is calculated and the fact that memory is not as easily released by applications as CPU can be.