The most surprising thing about right-sizing AKS pods with Vertical Pod Autoscaler (VPA) is that it doesn’t actually scale pods in the way you might expect; it recommends and applies resource adjustments to existing ones.

Let’s see it in action. Imagine we have a simple Nginx deployment running, but we suspect it’s hogging resources or not getting enough.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "200m"
            memory: "256Mi"

We want VPA to tell us if these requests and limits are appropriate. First, we need to install VPA. The easiest way is via the AKS Add-on.

az aks enable-addons --addons vertical-pod-autoscaler --name myAKSCluster --resource-group myResourceGroup

Once installed, VPA runs as a set of pods in the kube-system namespace. It monitors running pods that have VPA applied and, based on historical resource usage, suggests new requests and limits.

Now, let’s tell VPA to manage our Nginx deployment. We create a VerticalPodAutoscaler resource:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: nginx-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  updatePolicy:
    updateMode: "Off" # Start with "Off" to just get recommendations

Apply this: kubectl apply -f nginx-vpa.yaml.

After a few minutes, VPA will have collected some usage data. We can check its recommendations:

kubectl get vpa nginx-vpa -o yaml

Look for the recommendations.containerPolicies section. You’ll see something like this:

status:
  conditions:
  - lastTransitionTime: "2023-10-27T10:30:00Z"
    message: 'Recommending for container nginx. Current CPU: 100m, Memory: 128Mi. Recommended CPU: 10m, Memory: 180Mi. Recommended Limit CPU: 20m, Limit Memory: 300Mi.'
    reason: RecommendationReady
    status: "True"
    type: RecommendationReady
  recommendations:
    containerPolicies:
    - containerName: nginx
      lowerBound:
        cpu: 10m
        memory: 180Mi
      recalculationPeriod: 12h
      target:
        cpu: 10m
        memory: 180Mi
      unscaledCPU: 100m
      unscaledMemory: 128Mi
      upperBound:
        cpu: 20m
        memory: 300Mi

This output tells us that for the nginx container, VPA observed it using a maximum of 10m CPU and 180Mi memory. It recommends setting the requests to 10m CPU and 180Mi memory. The limits are also adjusted, often to a slightly higher value than the recommended request to allow for bursts.

The Problem VPA Solves: Over-provisioning or under-provisioning pod resources is a common issue. Over-provisioning wastes valuable compute capacity on your cluster, driving up costs. Under-provisioning leads to performance degradation, application instability, and OOMKilled (Out Of Memory) errors. Manual tuning is tedious, error-prone, and doesn’t adapt to changing workloads.

How it Works Internally: VPA works by installing a mutating admission webhook and a controller. The controller periodically scrapes metrics from the Metrics Server for pods it’s managing. It then applies a set of heuristics to calculate recommended CPU and memory requests. The heuristics consider average usage, 95th percentile usage, and aim to minimize evictions and OOMKills while staying within reasonable bounds. The admission webhook intercepts pod creation requests and, if VPA is configured to apply recommendations (updateMode: "Auto" or "Recreate"), it injects the calculated requests and limits into the pod’s container specifications.

Levers You Control:

  • targetRef: Specifies which workload (Deployment, StatefulSet, DaemonSet, ReplicaSet) VPA should monitor.
  • updatePolicy.updateMode:
    • "Off": VPA only provides recommendations, no automatic changes are made. This is great for initial analysis.
    • "Initial": VPA applies recommendations only when a pod is first created. It won’t adjust resources for existing pods.
    • "Auto": VPA will automatically update requests and limits for existing pods. This is the most powerful mode but can be disruptive as pods may be recreated to apply changes.
    • "Recreate": Similar to Auto, but VPA explicitly recreates pods to apply the updated resources.
  • resourcePolicy: Allows you to set minimum and maximum values for CPU and memory that VPA can recommend, providing guardrails.

The one thing most people don’t realize is that VPA’s "recommendation" is based on observed usage, not necessarily demand. If your application has periods of low activity that don’t reflect its peak requirements, VPA might recommend lower resources than are actually needed for those peak times. This is why updateMode: "Off" is so valuable for understanding what VPA thinks it needs, allowing you to manually adjust or set resourcePolicy bounds.

The next concept you’ll likely explore is how VPA interacts with Horizontal Pod Autoscaler (HPA) and when to use one over the other, or even combine them.

Want structured learning?

Take the full Aks course →