AKS clusters are designed for high availability and are typically left running 24/7. However, for development, testing, or non-production workloads, this can lead to significant, unnecessary costs. The most cost-effective way to manage these resources is to start and stop your AKS clusters when they are not in use.
Let’s see this in action. Imagine you have a development cluster that’s only needed during business hours, say 9 AM to 5 PM, Monday to Friday.
# Simulate starting the cluster on Monday at 9 AM
az aks start --resource-group myDevRG --name myDevCluster
# ... work is done ...
# Simulate stopping the cluster on Friday at 5 PM
az aks stop --resource-group myDevRG --name myDevCluster
This simple az aks start and az aks stop command pair is your primary tool. When you stop a cluster, the control plane remains running, but the worker nodes are deallocated. This means you stop paying for the compute resources consumed by those worker nodes. You’ll still incur a small charge for the API server and etcd, but this is negligible compared to the cost of running the nodes.
The core problem this solves is the "always-on" cost of cloud infrastructure for intermittent workloads. Many services, especially in development and testing, don’t require continuous availability. Leaving them running incurs compute costs for idle resources. By dynamically stopping and starting clusters, you align infrastructure spending directly with actual usage.
Internally, when you execute az aks stop, Azure performs the following:
- Deallocates Worker Nodes: The virtual machines that make up your worker nodes are stopped and their disks are detached. This is the primary cost saving. You are no longer billed for the CPU and RAM of these VMs.
- Preserves Cluster State: The Kubernetes API server, etcd, and other control plane components remain active and accessible. This is crucial because it means your cluster’s configuration, deployed applications, and persistent volumes are preserved. When you start the cluster again, the worker nodes are re-provisioned and attached to the existing control plane, restoring your environment exactly as you left it.
- Network Configuration Intact: Network security groups, load balancers, and DNS records associated with your cluster are generally unaffected, ensuring seamless resumption of services.
To implement this effectively, you’ll want to automate the start and stop process. Azure Automation Runbooks or Azure Functions are excellent choices for this. You can schedule these runbooks to execute at specific times.
Here’s an example of an Azure Automation Runbook script to stop a cluster:
# Azure Automation Runbook script (PowerShell)
param(
[string] $ResourceGroupName,
[string] $AKSClusterName
)
Write-Host "Attempting to stop AKS cluster '$AKSClusterName' in resource group '$ResourceGroupName'..."
try {
# Authenticate using Managed Identity if configured, or service principal
# Ensure the Managed Identity or Service Principal has Contributor or AKS Contributor role on the subscription/resource group
$aks = Get-AzAks -ResourceGroupName $ResourceGroupName -Name $AKSClusterName
if ($aks.PowerState -eq "Stopped") {
Write-Host "AKS cluster '$AKSClusterName' is already stopped."
} else {
Stop-AzAks -ResourceGroupName $ResourceGroupName -Name $AKSClusterName -Force
Write-Host "AKS cluster '$AKSClusterName' stopped successfully."
}
} catch {
Write-Error "Failed to stop AKS cluster '$AKSClusterName'. Error: $($_.Exception.Message)"
exit 1
}
And a corresponding script to start it:
# Azure Automation Runbook script (PowerShell)
param(
[string] $ResourceGroupName,
[string] $AKSClusterName
)
Write-Host "Attempting to start AKS cluster '$AKSClusterName' in resource group '$ResourceGroupName'..."
try {
$aks = Get-AzAks -ResourceGroupName $ResourceGroupName -Name $AKSClusterName
if ($aks.PowerState -eq "Running") {
Write-Host "AKS cluster '$AKSClusterName' is already running."
} else {
Start-AzAks -ResourceGroupName $ResourceGroupName -Name $AKSClusterName
Write-Host "AKS cluster '$AKSClusterName' started successfully."
}
} catch {
Write-Error "Failed to start AKS cluster '$AKSClusterName'. Error: $($_.Exception.Message)"
exit 1
}
You would then create schedules for these runbooks within Azure Automation. For instance, a "StartDevCluster" schedule to run every Monday at 8:55 AM and a "StopDevCluster" schedule to run every Friday at 5:05 PM.
Crucially, when you stop an AKS cluster, the worker nodes are deallocated, but the underlying Azure Virtual Machine Scale Set (VMSS) is not deleted. This is what allows for rapid resumption of services. The VMSS configuration, including the OS image, extensions, and desired node count, is maintained. When you issue the start command, Azure re-creates the VMs based on this configuration and attaches them to the persistent control plane. The actual time it takes to start can vary, but it’s typically a few minutes for a standard cluster.
One common pitfall is not granting the Automation Account’s Managed Identity (or the Service Principal used for authentication) the necessary permissions. The identity needs at least the "Azure Kubernetes Service Contributor" role (or "Contributor" role) on the subscription or resource group containing the AKS cluster to perform start and stop operations. Without these permissions, your scheduled jobs will fail.
Once your clusters are reliably starting and stopping, you’ll likely encounter the challenge of managing the IP addresses of services that were exposed via LoadBalancer type Kubernetes Services. When a cluster stops and starts, the public IP address assigned to the LoadBalancer might change if it was dynamically allocated.