Azure Kubernetes Service (AKS) is a managed Kubernetes offering that simplifies deploying, managing, and scaling containerized applications. While it offers significant advantages in terms of agility and scalability, the associated costs can quickly escalate if not managed proactively. This article explores proven strategies to optimize AKS costs, ensuring you get the most value from your cloud investment.

The most surprising thing about AKS cost optimization is that it often starts outside of Kubernetes itself, focusing on the underlying Azure infrastructure.

Let’s dive into how this works with a practical example. Imagine you have a microservices-based application running on AKS. You’ve noticed your monthly bill is higher than anticipated.

1. Right-Sizing Your Node Pools

The most significant cost driver in AKS is the virtual machines (VMs) that form your node pools. Running oversized VMs consumes more resources than your applications actually need.

  • Diagnosis: Use the kubectl top nodes command to see CPU and memory utilization for your nodes.
    kubectl top nodes
    
    Look for nodes consistently running at low utilization (e.g., <30% CPU, <50% memory).
  • Fix: Scale down the VM size or the number of nodes in your node pool. For instance, if you have a Standard_D4s_v3 node pool with 5 nodes running at low utilization, consider changing to Standard_D2s_v3 with 5 nodes, or keeping Standard_D4s_v3 but reducing to 3 nodes.
  • Why it works: You pay for the VM instance size and count. Matching VM resources to actual workload demands directly reduces compute costs.

2. Leveraging Azure Spot Virtual Machines

Spot VMs offer significant discounts (up to 90%) on unused Azure capacity. They are ideal for fault-tolerant or non-critical workloads.

  • Diagnosis: Review your current node pools. Identify workloads that can tolerate interruptions, such as batch processing, CI/CD jobs, or stateless web applications that can be scaled down and up quickly.
  • Fix: Create a new node pool configured with Spot VMs. For example, when creating a node pool:
    az aks nodepool add \
        --resource-group myResourceGroup \
        --cluster-name myAKSCluster \
        --name spotpool \
        --node-count 3 \
        --node-vm-size Standard_D2s_v3 \
        --spot
    
  • Why it works: You are essentially bidding on spare Azure capacity. While these VMs can be evicted with short notice, the cost savings are substantial for appropriate workloads.

3. Optimizing Persistent Storage

The type and size of your persistent storage (Azure Disks) can be a hidden cost. Over-provisioning disk space or using premium tiers when standard is sufficient adds up.

  • Diagnosis: Inspect your Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) using kubectl get pv and kubectl get pvc. Check the StorageClass and Capacity fields.
    kubectl get pv
    kubectl get pvc --all-namespaces
    
    Identify PVs using StandardSSD_LRS or Premium_LRS that could be Standard_LRS if I/O demands are low, or Premium_ZRS when zone redundancy isn’t strictly necessary. Also, look for PVCs that are much larger than their actual data usage.
  • Fix: If a PVC is underutilized, consider creating a new, smaller PVC and migrating data. For less demanding workloads, change the StorageClass in your application’s StatefulSet or Deployment YAML to a more cost-effective option like Standard_LRS.
    # Example StatefulSet snippet
    volumeClaimTemplates:
    - metadata:
        name: my-pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: Standard_LRS # Changed from Premium_LRS
        resources:
          requests:
            storage: 10Gi # Reduced from 50Gi
    
  • Why it works: Different Azure Disk types have varying costs per GB and IOPS/throughput. Choosing the right tier and capacity directly impacts storage expenditure.

4. Implementing Resource Quotas and Limits

Without proper resource requests and limits, pods can consume more CPU and memory than allocated, potentially impacting node stability and forcing the cluster to scale up unnecessarily.

  • Diagnosis: Use kubectl top pods --all-namespaces to see actual resource consumption by pods. Compare this to the requests and limits defined in your pod specifications.
    kubectl top pods --all-namespaces
    
    Look for pods with no requests or limits set, or where actual usage significantly exceeds requests.
  • Fix: Define realistic requests and limits for CPU and memory in your pod YAMLs.
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "200m"
        memory: "256Mi"
    
  • Why it works: requests inform the Kubernetes scheduler about the minimum resources a pod needs, preventing over-scheduling. limits prevent runaway pods from consuming all node resources, ensuring predictable performance and avoiding unnecessary node scaling.

5. Utilizing Autoscaling Wisely

Both the Cluster Autoscaler (CA) and the Horizontal Pod Autoscaler (HPA) are powerful tools, but misconfiguration can lead to overspending.

  • Diagnosis: Monitor the HPA’s TARGETS and METRICS to understand why pods are scaling. Check the CA’s events (kubectl get events -n kube-system) to see why nodes are being added or removed.
    kubectl get hpa -n <your-namespace>
    kubectl get events -n kube-system | grep -i cluster-autoscaler
    
    Look for scenarios where the CA is adding nodes that are consistently underutilized, or where HPA is scaling pods aggressively without a clear need.
  • Fix: Tune HPA thresholds (e.g., targetCPUUtilizationPercentage) to match actual application performance needs. Configure CA’s min-nodes and max-nodes per node pool to prevent excessive scaling. Consider using vertical-pod-autoscaler for more granular control over pod resource requests.
  • Why it works: Fine-tuning autoscaling ensures that compute resources are provisioned only when truly needed and are sized appropriately, preventing idle resources from incurring costs.

6. Employing Reserved Instances or Savings Plans

For predictable baseline workloads, Azure Reserved Instances (RIs) or Azure Savings Plans for compute offer substantial discounts compared to pay-as-you-go pricing.

  • Diagnosis: Analyze your historical AKS VM usage patterns. Identify the consistent baseline demand that your cluster typically requires.
  • Fix: Purchase Azure Reserved Instances or Savings Plans that cover this predictable baseline compute consumption. For example, commit to a 1-year or 3-year term for specific VM families that align with your AKS node pool sizes.
  • Why it works: Committing to a term length provides Azure with predictable revenue, and in return, you receive significant discounts on your compute costs for the duration of the commitment.

The next challenge you’ll likely face after optimizing node costs is managing the egress traffic costs, which can be surprisingly high for chatty microservices or applications serving large amounts of data externally.

Want structured learning?

Take the full Aks course →