AKS cluster creation is surprisingly easy to get wrong in ways that only bite you months or years later.
Let’s walk through setting up a production-ready AKS cluster.
First, we need to think about networking. AKS can use Azure CNI or Kubenet. For production, Azure CNI is the way to go. It gives each pod its own IP address from the VNet subnet, which means better network policy enforcement and direct pod-to-pod communication without NAT.
Here’s a az aks create command that sets up a good starting point:
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--node-count 3 \
--enable-addons monitoring \
--generate-ssh-keys \
--network-plugin azure \
--vnet-subnet-id "/subscriptions/YOUR_SUBSCRIPTION_ID/resourceGroups/myResourceGroup/providers/Microsoft.Network/virtualNetworks/myVNet/subnets/myAKSNetworkSubnet" \
--dns-service-ip 10.240.0.10 \
--service-cidr 10.240.0.0/24 \
--docker-bridge-address 172.17.0.1/16 \
--attach-acr myACRRegistryName
Let’s break down the important bits for production:
--node-count 3: Starting with at least 3 nodes ensures high availability for your workloads. If one node goes down, your application can still run.--enable-addons monitoring: This automatically sets up Azure Monitor for containers, giving you crucial visibility into cluster performance, logs, and metrics. You’ll need this for troubleshooting and capacity planning.--network-plugin azure: As mentioned, this is Azure CNI. It’s vital for granular network security and direct pod IP addressing.--vnet-subnet-id ...: You must have a dedicated subnet for your AKS nodes. Don’t share it with other resources. This subnet needs to be large enough to accommodate your initial nodes and future scaling, plus all the IPs for your pods if you’re using Azure CNI. A/24is often a good starting point for the node subnet.--dns-service-ip 10.240.0.10: This is the IP address for the Kubernetes DNS service (CoreDNS). It must be within your VNet but outside the subnet range used by your nodes and pods.--service-cidr 10.240.0.0/24: This defines the IP address range for Kubernetes services. Again, this must be within your VNet but not overlap with your node subnet or the DNS service IP. A/24provides 256 IPs, which is usually sufficient for services.--docker-bridge-address 172.17.0.1/16: This is the default bridge network for Docker on the nodes. It’s generally safe to leave this as default, but ensure it doesn’t conflict with any other network ranges in your VNet.--attach-acr myACRRegistryName: If you’re using Azure Container Registry (ACR) for your images, attaching it here grants the AKS cluster read-only access, simplifying image pulls.
Beyond these initial settings, consider enabling HTTP application routing for easy ingress management, and configure node pools for different workload types (e.g., GPU nodes, high-CPU nodes). For true production, you’ll also want to set up private clusters for enhanced security, meaning the API server is not exposed to the public internet.
The most surprising thing about AKS auto-scaling is how it interacts with node pool limits. If you have a cluster that’s set to scale up to 10 nodes, but your node pool has a maximum limit of 5, the cluster will only ever scale up to those 5 nodes, regardless of pod resource requests.
Here’s an example of configuring a user node pool with specific VM sizes and autoscaling enabled:
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name userpool \
--node-count 2 \
--node-vm-size Standard_DS3_v2 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 5
In this setup, the userpool will start with 2 Standard_DS3_v2 VMs. The cluster autoscaler will ensure there’s always at least 1 node and will scale up to a maximum of 5 nodes in this pool if needed, based on pod scheduling demands.
One aspect that trips many people up is the kube-system namespace. While you can deploy workloads there, it’s strongly discouraged for production applications. This namespace is reserved for core Kubernetes components and AKS-managed services. Deploying your own applications here can lead to conflicts and make it harder to manage cluster upgrades.
The next thing you’ll likely want to tackle is implementing robust CI/CD pipelines to deploy your applications reliably.