Consul snapshots are not just point-in-time backups; they are a full serialization of the Consul state machine, including all data and configuration.

Let’s see what a snapshot looks like in action. Imagine you have a running Consul cluster and you want to back up its current state.

consul snapshot create -snapshot-token=your-secret-token -kind=all -output-file=consul-backup-$(date +%Y%m%d%H%M%S).snap

This command will generate a file, something like consul-backup-20231027103000.snap, containing everything Consul knows at that exact moment. This includes service registrations, health check statuses, key-value store contents, and node information.

The core problem Consul snapshots solve is guaranteeing cluster consistency during recovery. Unlike simple file copies, a snapshot captures the entire state machine’s progress. When you restore from a snapshot, you’re not just putting files back; you’re rewinding the state machine to a specific, consistent point. This is crucial because Consul is a distributed system where data is replicated. A simple file copy might grab metadata from one point and data from another, leading to an inconsistent and unrecoverable state.

Internally, Consul uses Raft for consensus. A snapshot is essentially a serialized representation of the Raft log up to a certain index. When you create a snapshot, Consul:

  1. Waits for the current Raft leader to have applied all outstanding commands.
  2. Serializes the entire state machine (FSM) to disk.
  3. Writes this serialized state along with the Raft index it corresponds to.

When you restore, Consul starts with this serialized state as its initial FSM and then replays any log entries that occurred after the snapshot’s index. This ensures that the cluster comes back up with all data in a consistent state, as if it had been running continuously from that point.

The primary levers you control are the kind of snapshot and the snapshot-token.

  • kind: This can be all (the default, which includes all data) or kv (only the Key-Value store). For a full cluster recovery, all is what you want.
  • snapshot-token: This is your ACL token used to authorize the snapshot creation. If your Consul cluster has ACLs enabled, you must provide a token with snapshot.create and snapshot.list permissions. Without it, the command will fail with a permission denied error.

You can also specify a consul-agent-token if you are running the command as a client agent and don’t want to use the default token.

Restoring is just as straightforward:

consul snapshot restore -snapshot-token=your-secret-token consul-backup-20231027103000.snap

This command will stop the Consul agent, replace its state with the contents of the snapshot file, and then restart the agent. If you are restoring to an existing cluster, it’s generally recommended to do this on a new cluster or after terminating the old one, as restoring to a live, running cluster can have unexpected side effects if not managed carefully. For a full recovery, you typically bring up a new cluster and restore to the first node.

The most surprising aspect of Consul snapshot restoration is how it handles the Raft log. It doesn’t just load the state; it also loads the Raft log up to the snapshot index. Any log entries after that index are then replayed. This means that if you have a snapshot from index 1000 and the current Raft log goes up to index 1200, restoring will bring the cluster to the state at index 1000 and then re-apply the operations from indices 1001 through 1200. This guarantees that the cluster’s state is not just restored but also brought up-to-date with any operations that happened between the snapshot and the restore, assuming those operations are still present in the Raft log.

When restoring, Consul requires that the agent be in a non-running state. If you attempt to restore while the Consul agent is running, you’ll get an error indicating that the agent must be stopped. This is because the restoration process involves overwriting critical state files and re-initializing the Raft state machine.

The next logical step after mastering snapshot backups and restores is understanding how to automate this process and integrate it into a disaster recovery plan.

Want structured learning?

Take the full Consul course →