Azure VM Scaling Secrets: Beyond CPU

Azure VMs give you a lot of power, but managing their availability and scale can feel like juggling chainsaws. You’ve got Availability Sets, Virtual Machine Scale Sets (VMSS), and Auto-Scale Rules, all seemingly doing similar things. The key is understanding they solve different problems, and often, you’ll use them in combination.

Let’s see what happens when you actually use these things. Imagine you’re running a web application and want to handle traffic spikes and ensure uptime.

First, you’d deploy your VMs. Let’s say you have a basic setup with two VMs.

# Create a resource group
az group create --name my-vm-scale-rg --location eastus

# Create an Availability Set
az vm availability-set create --resource-group my-vm-scale-rg --name my-availability-set --platform-fault-domain-count 2 --platform-update-domain-count 3

# Create VM 1 within the Availability Set
az vm create \
  --resource-group my-vm-scale-rg \
  --name vm-app-01 \
  --image Ubuntu22_04 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --availability-set my-availability-set

# Create VM 2 within the Availability Set
az vm create \
  --resource-group my-vm-scale-rg \
  --name vm-app-02 \
  --image Ubuntu22_04 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --availability-set my-availability-set

At this point, vm-app-01 and vm-app-02 are in my-availability-set. Azure guarantees that during planned maintenance or unplanned hardware failures, no more than one of these VMs will be affected at a time. This is because they are spread across different fault domains (physical racks) and update domains (groups for planned maintenance). You can check this with:

az vm show -g my-vm-scale-rg -n vm-app-01 -d --query "{Name:name, AS:availabilitySet.id}"
az vm show -g my-vm-scale-rg -n vm-app-02 -d --query "{Name:name, AS:availabilitySet.id}"

This setup gives you high availability for a fixed number of VMs. But what if traffic increases, and two VMs aren’t enough? That’s where Virtual Machine Scale Sets (VMSS) come in. VMSS are designed to manage and scale a group of identical, load-balanced VMs. Instead of manually creating VMs one by one, you define a desired state.

Let’s create a VMSS. Notice we don’t explicitly mention availability sets here; VMSS handles that automatically.

# Create a VMSS
az vmss create \
  --resource-group my-vm-scale-rg \
  --name my-vmss \
  --image Ubuntu22_04 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --instance-count 2 \
  --load-balancer my-load-balancer \
  --vnet-name my-vnet \
  --subnet my-subnet

This command creates a VMSS named my-vmss with two initial instances. Azure automatically places these instances across fault and update domains for high availability, similar to an Availability Set. The --load-balancer flag ensures that incoming traffic is distributed across these instances. You can scale this group manually:

# Scale up to 5 instances
az vmss scale --resource-group my-vm-scale-rg --name my-vmss --new-capacity 5

# Scale down to 3 instances
az vmss scale --resource-group my-vm-scale-rg --name my-vmss --new-capacity 3

Now, the real magic for handling unpredictable traffic comes with Auto-Scale Rules. These rules dynamically adjust the number of VMSS instances based on performance metrics.

# Get the resource ID of the VMSS
vmss_resource_id=$(az vmss show --resource-group my-vm-scale-rg --name my-vmss --query id -o tsv)

# Create an auto-scale setting
az monitor autoscale create \
  --resource-group my-vm-scale-rg \
  --name my-vmss-autoscale \
  --resource $vmss_resource_id \
  --min-count 2 \
  --max-count 10 \
  --count 3 # Initial count

# Add a scale-out rule (increase instances when CPU is high)
az monitor autoscale rule create \
  --resource-group my-vm-scale-rg \
  --autoscale-name my-vmss-autoscale \
  --scale out \
  --condition "Percentage CPU > 75" \
  --count 2 \
  --direction GreaterThan

# Add a scale-in rule (decrease instances when CPU is low)
az monitor autoscale rule create \
  --resource-group my-vm-scale-rg \
  --autoscale-name my-vmss-autoscale \
  --scale in \
  --condition "Percentage CPU < 25" \
  --count 1 \
  --direction LessThan

Here, my-vmss-autoscale is configured to keep between 2 and 10 instances. If the average CPU across all VMSS instances goes above 75% for 5 minutes, it will add 2 instances. If it drops below 25% for 5 minutes, it will remove 1 instance. The --count parameter in the autoscale create command sets the initial number of instances when the autoscale setting is first applied, or when the scale set is created if it’s done simultaneously. The min-count and max-count define the boundaries for scaling, and the specific rules dictate when and how many instances are added or removed.

The most surprising truth about VMSS and Availability Sets is that VMSS inherently provides the guarantees of an Availability Set for its instances. You don’t need to create a separate Availability Set for a VMSS. When you create a VMSS, Azure automatically distributes its instances across fault and update domains to ensure high availability. The platformFaultDomainCount and platformUpdateDomainCount properties are implicitly managed by the VMSS service itself, aiming for optimal distribution based on the Azure region’s capabilities.

When you configure auto-scale rules, you’re essentially telling Azure to dynamically manage the number of these highly available instances. You define the minimum and maximum capacity, and the conditions (like CPU utilization, network in/out, or custom metrics) that trigger scaling events. The system then orchestrates the creation or deletion of VM instances to meet these demands, all while maintaining the underlying availability guarantees. The load balancer is crucial here; it ensures that as instances are added or removed, traffic is seamlessly routed to healthy ones.

The az vmss create command implicitly creates an Availability Configuration for the VMSS, distributing instances across fault and update domains. If you need finer control over the number of fault/update domains for your VMSS, you can specify --fault-domain-count and --update-domain-count during VMSS creation, but for most use cases, the default distribution is sufficient and managed by Azure.

The next concept you’ll likely encounter is managing stateful applications within VMSS, which requires more advanced configurations like using shared storage or implementing distributed databases.