Consul’s Autopilot mode is designed to let you upgrade your Consul cluster without any service interruptions, but it’s a bit of a misnomer – it’s more of an assisted upgrade than a fully automatic one.
Let’s see it in action. Imagine you have a Consul cluster running on three servers: consul-server-1, consul-server-2, and consul-server-3. You want to upgrade from version 1.10.0 to 1.11.0.
First, you need to enable Autopilot mode on your existing Consul servers. This tells Consul that it’s okay to manage its own state during an upgrade.
On consul-server-1:
consul agent -node="consul-server-1" -server -bootstrap-expect=3 -config-dir="/etc/consul.d" -enable-local-script-checks=true -autopilot-enable=true
On consul-server-2:
consul agent -node="consul-server-2" -server -bootstrap-expect=3 -config-dir="/etc/consul.d" -enable-local-script-checks=true -autopilot-enable=true
On consul-server-3:
consul agent -node="consul-server-3" -server -bootstrap-expect=3 -config-dir="/etc/consul.d" -enable-local-script-checks=true -autopilot-enable=true
Once Autopilot is enabled, you can start the upgrade process by upgrading one server at a time. You’ll stop the Consul agent on a server, replace the binary with the new version, and then restart it.
On consul-server-1 (after stopping the agent):
# Download and install Consul 1.11.0 binary
sudo apt-get update && sudo apt-get install -y unzip
wget https://releases.hashicorp.com/consul/1.11.0/consul_1.11.0_linux_amd64.zip
unzip consul_1.11.0_linux_amd64.zip
sudo mv consul /usr/local/bin/
sudo systemctl start consul
After restarting the agent with the new binary, Consul’s Autopilot will detect the version mismatch. It will then initiate a rolling upgrade process. The key is that Autopilot ensures a quorum is maintained throughout this process. It won’t allow a server to rejoin the cluster if it would bring the cluster below the quorum threshold.
To monitor the upgrade, you can use the Consul CLI:
consul members -detailed
You’ll see the servers listed with their respective versions. As each server is upgraded, Autopilot will coordinate the state transfer and ensure the cluster remains healthy.
The magic behind Autopilot’s zero-downtime upgrade lies in its state management and quorum enforcement. When a server is upgraded, it temporarily leaves the Raft consensus group. Autopilot tracks the versions of all servers in the cluster. It intelligently promotes a new leader if necessary and ensures that at least (N/2) + 1 servers are running the new version before it allows the upgraded server to fully rejoin and potentially become a leader again. This prevents any split-brain scenarios or loss of quorum, which would otherwise cause service discovery and configuration lookups to fail.
Here’s the crucial part most people miss: Autopilot doesn’t just upgrade the binary. It also manages the Raft state machine. When a server restarts with a new Consul binary, it needs to load the latest Raft snapshot and catch up on any log entries. Autopilot facilitates this by ensuring that the server attempting to rejoin is in a state where it can be reintegrated without disrupting the consensus of the remaining active servers. If a server is significantly behind, Autopilot might even trigger a state transfer from another server to speed up its recovery.
Once all servers have been upgraded and restarted with the new binary, the cluster will be running Consul 1.11.0 across all nodes, and you can disable Autopilot mode if you wish.
After a successful Autopilot upgrade, the next thing you’ll likely encounter is the need to upgrade your Consul clients or explore more advanced monitoring configurations.