AWS Disaster Recovery: Meet Your RTO and RPO

The most surprising thing about AWS Disaster Recovery (DR) is that it’s not a single product or feature, but a strategy you build using a combination of AWS services, often with surprising efficiency.

Let’s see it in action. Imagine a simple web application running on EC2 instances behind an Application Load Balancer (ALB), with data in an RDS database.

// Example EC2 Instance Configuration
{
  "InstanceId": "i-0123456789abcdef0",
  "InstanceType": "t3.medium",
  "ImageId": "ami-0abcdef1234567890",
  "SubnetId": "subnet-0123456789abcdef0",
  "SecurityGroupIds": ["sg-0123456789abcdef0"],
  "Tags": [
    {"Key": "Name", "Value": "WebApp-Primary"},
    {"Key": "Environment", "Value": "Production"}
  ]
}

// Example RDS Instance Configuration
{
  "DBInstanceIdentifier": "webapp-db-primary",
  "DBInstanceClass": "db.t3.medium",
  "Engine": "postgres",
  "AllocatedStorage": 100,
  "MultiAZ": true,
  "PubliclyAccessible": false,
  "VpcSecurityGroupIds": ["sg-0123456789abcdef1"],
  "Tags": [
    {"Key": "Name", "Value": "WebApp-DB-Primary"},
    {"Key": "Environment", "Value": "Production"}
  ]
}

// Example ALB Configuration
{
  "LoadBalancerArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/webapp-alb/abcdef1234567890",
  "DNSName": "webapp-alb-1234567890.us-east-1.elb.amazonaws.com",
  "Scheme": "internet-facing",
  "Type": "application",
  "VpcId": "vpc-0123456789abcdef0",
  "State": {"Code": "active"},
  "AvailabilityZones": [
    {"ZoneName": "us-east-1a", "SubnetId": "subnet-0123456789abcdef0"},
    {"ZoneName": "us-east-1b", "SubnetId": "subnet-0123456789abcdef1"}
  ]
}

To achieve DR, we need to think about two key metrics:

  • Recovery Time Objective (RTO): How quickly must the application be available after a disaster?
  • Recovery Point Objective (RPO): How much data loss is acceptable (measured in time)?

AWS offers a spectrum of DR strategies, from simple backups to multi-region active-active deployments. The choice depends entirely on your RTO and RPO requirements and your budget.

At its core, DR on AWS is about having a viable, up-to-date copy of your infrastructure and data in a separate location (typically another AWS Region) that can be activated if your primary environment fails.

Here’s how you might build a DR strategy for our web app:

  1. Data Replication: For RDS, the easiest way to meet a low RPO is Cross-Region Read Replicas. You can create a read replica of your primary RDS instance in a different AWS Region. AWS handles the replication automatically. If your primary database fails, you promote the read replica to a standalone instance.

    • Command Example: aws rds create-db-instance-read-replica --db-instance-identifier webapp-db-primary --db-instance-read-replica-identifier webapp-db-replica-us-west-2 --region us-east-1 --availability-zone us-east-1a --kms-key-id arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id --tags Key=Name,Value=WebApp-DB-Replica-US-West-2
    • Why it works: This continuously streams transaction logs from the primary to the replica, minimizing data loss. Promotion is a quick operation.
  2. Infrastructure Replication: For EC2 instances, you can use AWS Backup with cross-region copy capabilities or Amazon Machine Image (AMI) creation and copying. You can automate the creation of AMIs from your running instances and copy these AMIs to your DR region.

    • Command Example (create AMI): aws ec2 create-image --instance-id i-0123456789abcdef0 --name "WebApp-Primary-AMI-$(date +%Y-%m-%d-%H-%M-%S)" --no-reboot
    • Command Example (copy AMI): aws ec2 copy-image --source-image-id ami-0abcdef1234567890 --source-region us-east-1 --name "WebApp-Primary-AMI-Copied-US-West-2" --destination-region us-west-2
    • Why it works: AMIs are snapshots of your instance’s root volume and any attached data volumes. Copying them to another region ensures you have the base operating system and application installed ready to launch.
  3. Networking: In your DR region, you’ll need a VPC, subnets, security groups, and potentially an ALB. You can pre-configure these or use Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform to deploy them rapidly. For DNS failover, Amazon Route 53 is your best friend, using health checks to automatically reroute traffic.

    • Route 53 Health Check Example: Configure a health check that monitors a specific endpoint on your application. If it fails, Route 53 can automatically switch traffic to a DR endpoint.
    • Why it works: Pre-provisioned or rapidly deployable network resources ensure your application has a place to run and be accessible. Route 53 automates the critical step of directing users to the healthy environment.
  4. Pilot Light vs. Warm Standby vs. Multi-Site:

    • Pilot Light: The DR region has minimal infrastructure running (e.g., just the replicated database and AMIs). When a disaster strikes, you launch EC2 instances from AMIs, attach the replicated database, and update DNS. This is cost-effective but has a higher RTO.
    • Warm Standby: A scaled-down version of your production environment runs in the DR region (e.g., a few small EC2 instances, read replica RDS). It’s ready to scale up quickly. Lower RTO than pilot light, higher cost.
    • Multi-Site (Active-Active): Your application runs in both regions simultaneously. Traffic is load-balanced across both. Highest availability, lowest RTO/RPO, and highest cost.

The one thing many people don’t realize is the power of AWS Organizations and Service Control Policies (SCPs) in a DR scenario. You can define SCPs that restrict certain actions or resource types in a DR region unless a specific condition is met, like a "disaster declared" tag being applied to your account. This prevents accidental provisioning or deletion of DR resources, ensuring they are only used when intended.

The next challenge you’ll likely face is orchestrating the failover and failback process, especially for complex applications with many dependencies.

Want structured learning?

Take the full Aws course →