Elastic APM data, like any other data in Elasticsearch, can be backed up using snapshots.
Here’s how it works:
First, let’s see a snapshot in action. Imagine you have an Elasticsearch cluster with APM data. You want to back up this data.
PUT _snapshot/my_backup_repository/snapshot_2023-10-27T10:00:00
{
"indices": "apm-*-2023.10.*",
"include_global_state": false,
"ignore_unavailable": true
}
This command initiates a snapshot. The my_backup_repository is a named repository where your snapshots will be stored. The snapshot_2023-10-27T10:00:00 is the name of this specific snapshot. indices: "apm-*-2023.10.*" specifies that we are backing up all APM indices that match this pattern (e.g., apm-7.17.0-2023.10.25). include_global_state: false means we’re not backing up cluster-wide settings, and ignore_unavailable: true ensures the snapshot proceeds even if some shards are temporarily unavailable.
The primary problem this solves is data loss. If your cluster fails, indices get corrupted, or you accidentally delete data, a snapshot allows you to restore your APM data to a previous state. This is critical for debugging historical performance issues or auditing past events.
Internally, Elasticsearch snapshots are stored in a repository. This repository can be a shared filesystem (like an NFS mount), an Amazon S3 bucket, a Google Cloud Storage bucket, or Azure Blob Storage. When you take a snapshot, Elasticsearch copies the index files (the actual data) and metadata for the specified indices to this repository. Each snapshot is an atomic unit; it represents the state of the indices at a specific point in time.
The key levers you control are:
- Repository Configuration: Where your snapshots are stored. This is crucial for security, accessibility, and cost.
- Snapshot Naming: A clear naming convention helps you identify and manage your snapshots. Including timestamps is a common practice.
- Index Selection: Deciding which indices to back up. For APM, you’ll typically want to back up
apm-*indices, often with a time-based pattern. - Snapshot Frequency: How often you take snapshots. This depends on your recovery point objective (RPO) – how much data you can afford to lose.
- Retention Policies: How long you keep snapshots. This impacts storage costs and compliance.
When you restore, you point Elasticsearch to the snapshot in the repository and specify which indices you want to restore. You can restore to the same cluster or a different one. For APM, restoring the apm-* indices will bring back your transaction traces, errors, metrics, and other collected data.
The most surprising thing for many is how straightforward it is to manage cross-cluster snapshots, even when the clusters are in different cloud providers or on-premises. By configuring a repository that both clusters can access (e.g., an S3 bucket accessible from both AWS and your on-prem data center), you can back up data from one cluster and restore it to another with minimal fuss, enabling disaster recovery strategies or data migration with confidence.
If you’re restoring APM data and notice that certain agents are still reporting to the old, potentially lost, cluster, you’ll need to reconfigure your agents to point to the new or restored cluster.