The most surprising thing about Elasticsearch snapshots is that they aren’t just for disaster recovery; they’re the secret sauce for elastic, zero-downtime upgrades and complex data migrations.

Let’s say you have an Elasticsearch cluster running version 7.10.2, and you want to upgrade to 7.17.8. You can’t just stop the old cluster, start a new one, and hope for the best. You need a way to safely transfer all your data and cluster state. This is where snapshots come in.

Here’s a simplified view of how it works in action. Imagine you have a cluster with a few indices: logs-2023-10-26, users, and products.

First, you need a place to store your snapshots. This is a "snapshot repository." It’s typically a shared filesystem location or an object storage service like S3.

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots-bucket-12345",
    "region": "us-east-1",
    "role_arn": "arn:aws:iam::123456789012:role/ElasticsearchSnapshotRole"
  }
}

This tells Elasticsearch to use an S3 bucket named my-es-snapshots-bucket-12345 in us-east-1 as the repository, assuming an IAM role ElasticsearchSnapshotRole has the necessary permissions.

Now, to take a snapshot of everything:

PUT _snapshot/my_s3_repository/snapshot_before_upgrade?wait_for_completion=true
{
  "indices": "_all",
  "ignore_unavailable": "false",
  "include_global_state": true
}

This command initiates a snapshot named snapshot_before_upgrade in my_s3_repository. indices: "_all" means it will capture all indices. include_global_state: true is crucial; it captures settings like index templates, ILM policies, and cluster-level configurations. wait_for_completion=true makes the command block until the snapshot is done.

Once the snapshot is complete, you have a point-in-time copy of your entire cluster state and data.

To upgrade, you’d spin up a new Elasticsearch cluster with the target version (7.17.8). This new cluster would also have the my_s3_repository configured.

Then, you’d restore the snapshot to the new cluster:

POST _snapshot/my_s3_repository/snapshot_before_upgrade/_restore
{
  "indices": "_all",
  "ignore_unavailable": false,
  "include_global_state": true,
  "rename_pattern": "(.+)",
  "rename_replacement": "$1_restored"
}

This command restores all indices from snapshot_before_upgrade. The rename_pattern and rename_replacement are used here as an example if you wanted to restore to indices with a different naming convention, perhaps to run both clusters simultaneously for a period. If you want to replace the original, you’d omit these or use a pattern that doesn’t rename.

The mental model here is that a snapshot is an immutable, self-contained archive. When you restore, Elasticsearch reads this archive and recreates the indices and cluster state. This process is transactional, meaning if it fails midway, Elasticsearch will roll back the partial restore, leaving your target cluster clean.

The ignore_unavailable setting is a common point of confusion. If set to true, the snapshot will succeed even if some shards for the requested indices are unavailable. However, the restored indices will be missing data from those unavailable shards. For critical operations like upgrades, you almost always want this set to false to ensure a complete copy.

Understanding how Elasticsearch manages shard allocation during a restore is key. When you restore, Elasticsearch doesn’t magically place shards on specific nodes. It creates the index metadata and then requests the actual shard data from the snapshot repository. The cluster’s allocation logic then decides where to place these shards, just like it would when creating a new index.

The real magic of snapshots lies in their ability to precisely replicate cluster state, including settings that are not part of individual indices. This means things like index templates, ILM policies, and even user roles are captured and restored. This idempotency is what makes rolling upgrades and migrations predictable.

When you restore a snapshot, Elasticsearch doesn’t just copy data files. It reads the metadata from the snapshot, recreates the index structures, and then fetches the shard data blocks. The cluster’s shard allocation system then decides where to place these shards on the available nodes based on your cluster’s current configuration and capacity.

The next concept you’ll likely encounter is optimizing snapshot performance, particularly understanding how max_concurrent_file_uploads and max_concurrent_snapshot_requests in the repository settings impact throughput.

Want structured learning?

Take the full Elasticsearch course →