AKS control plane upgrades are designed to be non-disruptive, but sometimes a surge of activity during the upgrade can cause issues.

Common Causes and Fixes for Control Node Pool Upgrade Surges

The control plane upgrade in AKS is a critical operation that ensures your Kubernetes cluster stays up-to-date with the latest features and security patches. However, a sudden spike in resource consumption or API requests during this process can lead to performance degradation or even temporary unavailability of cluster services. This is often perceived as a "surge" in activity that overwhelms the control plane’s capacity.

Here are the most common reasons this happens and how to address them:

  1. High API Server Load Before Upgrade:

    • Diagnosis: Before initiating an upgrade, monitor the API server’s request rate and latency. You can do this by checking the apiserver_request_total and apiserver_request_duration_seconds_bucket metrics in Azure Monitor or Prometheus. A sustained high rate of requests (e.g., thousands per second) or increasing latency indicates a busy API server.
    • Fix: Identify and scale down or pause non-essential workloads that are making frequent API calls. This could include intensive CI/CD pipelines, autoscaling operations that are constantly adjusting pod counts, or noisy monitoring agents. For example, if you have an autoscaler aggressively scaling pods, you might temporarily increase its maxReplicas or minReplicas to reduce the churn of scale-up/scale-down events.
    • Why it works: Reducing the number of concurrent API requests frees up resources on the control plane, allowing it to handle the overhead of the upgrade process more smoothly.
  2. Excessive Watch Operations:

    • Diagnosis: Many controllers and applications use Kubernetes watch operations to monitor changes to resources. If a large number of watches are active, or if watches are being established and torn down frequently, it can strain the API server. Use kubectl get --raw /metrics/cadvisor and look for metrics related to open connections or watch counts if you have direct access to a node, or utilize cluster-level monitoring tools that expose API server watch metrics.
    • Fix: Optimize applications that rely heavily on watches. Instead of watching all resources, filter watches to only the specific types and namespaces needed. For instance, if a controller only needs to know about Pod events in the default namespace, ensure its watch is configured accordingly, not for all pods across all namespaces. Consider using informer patterns more efficiently in custom controllers.
    • Why it works: Each watch requires the API server to maintain state and push updates, so reducing the number of active watches lessens this persistent load.
  3. Resource Constraints on the Control Plane:

    • Diagnosis: While AKS manages the control plane, there are still underlying resource limits. If the control plane is already operating near its CPU or memory limits before the upgrade, it will struggle. Azure Monitor metrics for the AKS cluster, specifically those related to control plane CPU and memory utilization (though these are often aggregated and less granular for the user), can give an indication. Look for high kube_pod_container_cpu_usage_seconds_total and kube_pod_container_memory_usage_bytes for core control plane pods if you have visibility.
    • Fix: Scale up your node pools before the upgrade. While this doesn’t directly increase control plane resources, it can shift some workload pressure away from the control plane if certain operations (like pod scheduling) are heavily reliant on node availability and state. For example, if you have many pending pods due to insufficient node capacity, scaling up your node pools can resolve this and reduce control plane churn.
    • Why it works: More available nodes mean pods can be scheduled faster, reducing the backlog of pending pods and the associated API calls the control plane has to manage.
  4. Large Number of Pods or Resources:

    • Diagnosis: A very large cluster with tens of thousands of pods, services, or other Kubernetes objects can inherently put more strain on the control plane during any operation, including upgrades. Check your cluster’s object count via kubectl get all --all-namespaces | wc -l or by querying specific resource counts (e.g., kubectl get pods --all-namespaces | wc -l).
    • Fix: If possible, prune unnecessary resources or consider distributing workloads across multiple clusters. For ongoing management, implement stricter resource quotas and limits to prevent runaway resource creation. Ensure that any garbage collection or cleanup processes are running efficiently to remove stale objects.
    • Why it works: A smaller, more manageable set of active objects reduces the amount of state the control plane needs to track and update during an upgrade.
  5. Inefficient Admission Controllers:

    • Diagnosis: Custom or even built-in admission controllers that perform complex validation or mutation can add significant latency to API requests. Monitor the duration of API requests, paying close attention to requests that involve resources managed by admission controllers. You can often see this in API server logs or by tracing requests if your monitoring is set up for it.
    • Fix: Review and optimize any custom admission controllers. If they are performing expensive lookups or computations, cache results where possible or delegate heavy lifting to asynchronous processes. If a built-in controller is causing issues, consider if its functionality is strictly necessary for your current cluster state.
    • Why it works: Faster admission controller responses mean API requests complete more quickly, reducing the overall load on the API server.
  6. Network Latency or Bandwidth Issues:

    • Diagnosis: While less common for control plane internal operations, significant network latency between the control plane components or between the control plane and the nodes can exacerbate performance issues. Monitor network metrics for your AKS cluster nodes and any services communicating with the control plane.
    • Fix: Ensure your AKS cluster is deployed in a region with good network connectivity. If using private clusters, verify the VNet peering and any network security group rules are not introducing bottlenecks.
    • Why it works: Reliable and fast network communication ensures that API requests and responses are processed promptly, preventing delays that can compound during an upgrade.

After resolving these issues, the next error you might encounter is a NodeNotReady status for some of your worker nodes if the control plane was sufficiently impacted to lose communication with them.

Want structured learning?

Take the full Aks course →