Disaster Recovery (DR) is less about surviving the disaster and more about how quickly you can get back to business after one.

Let’s say you’re running an e-commerce site. A critical database server fails. Your Recovery Time Objective (RTO) is how long it takes to get the site back online for customers. Your Recovery Point Objective (RPO) is how much data you’re willing to lose, measured in time. If your RPO is 15 minutes, you can afford to lose up to 15 minutes of transactions.

Imagine this scenario:

# Current state: Single database server, no replication, daily backups
# User traffic peaks at 10 AM PST

2023-10-27 09:55:00 PST - User places order, payment processed.
2023-10-27 09:55:15 PST - Database server experiences catastrophic hardware failure.
2023-10-27 10:00:00 PST - Site is down.
2023-10-27 11:30:00 PST - DR team restores from last night's backup (00:00:00 PST).
2023-10-27 11:35:00 PST - Site is back online.

# Analysis:
# RTO achieved: 1 hour 35 minutes (from failure to site online)
# RPO achieved: 9 hours 55 minutes (from failure to last committed transaction)

This is a disaster. Your RTO is too high, and your RPO is astronomical. Your customers are furious, and your business is bleeding money.

To design an effective DR plan, you need to understand the components of your system and their dependencies.

1. Identify Critical Systems and Data:

  • Application Servers: Web servers, API gateways, microservices.
  • Databases: Transactional databases, data warehouses, caches.
  • Storage: File shares, object storage, block storage.
  • Networking: Load balancers, firewalls, DNS.
  • Third-Party Integrations: Payment gateways, shipping APIs, CRM.

For each, ask:

  • What is the business impact if this system is unavailable?
  • What is the maximum acceptable downtime (RTO)?
  • What is the maximum acceptable data loss (RPO)?

2. Define RTO and RPO Targets:

These targets are driven by business requirements, not just technical feasibility. A small startup might tolerate a higher RTO/RPO than a global financial institution.

  • RTO: For critical systems, this might be minutes or a few hours. For less critical ones, days.
  • RPO: For transactional systems, this is often near-zero (seconds or minutes). For analytical systems, it might be hours.

3. Choose a DR Strategy:

  • Backup and Restore: The simplest. Data is backed up regularly, and in a disaster, a new environment is provisioned, and data is restored. High RTO/RPO.
  • Pilot Light: A minimal version of your critical systems runs in the DR region. In a disaster, you scale up this minimal environment. Lower RTO/RPO than backup/restore.
  • Warm Standby: A scaled-down but fully functional version of your production environment runs in the DR region, with data being replicated. Lower RTO/RPO than pilot light.
  • Hot Standby (Active-Active/Active-Passive): A fully scaled production environment runs in the DR region, ready to take traffic immediately. Near-zero RTO/RPO. Most expensive.

4. Implement Replication and Failover:

This is where you technically achieve your RPO and RTO.

  • Database Replication:
    • Synchronous Replication: Writes are committed to both primary and replica databases before acknowledgment. Guarantees zero data loss (RPO=0) but can increase write latency and requires low network latency between regions. Example: PostgreSQL synchronous_commit = remote_write or remote_apply.
    • Asynchronous Replication: Writes are committed to the primary, then sent to the replica. Lower write latency but can result in data loss if the primary fails before the replica receives the data. Example: MySQL binlog_format = ROW, relay_log_recovery = 1.
    • Multi-Master Replication: Writes can occur on any node, and changes are propagated. Complex to manage, potential for conflicts.
  • Application State Replication:
    • Shared Storage: If your application servers rely on shared file systems (e.g., NFS, GlusterFS), ensure this storage is replicated or accessible from the DR site.
    • Distributed Caches: Use clustered caches (e.g., Redis Cluster) with replication enabled.
    • Message Queues: Ensure your message queue supports multi-AZ or multi-region deployments with durable message persistence.
  • Infrastructure as Code (IaC): Use tools like Terraform or CloudFormation to define and provision your DR environment. This ensures consistency and speed when bringing up the DR site.
    • Example Terraform module for a DR database:
      resource "aws_db_instance" "dr_replica" {
        identifier           = "myapp-dr-db"
        engine               = "postgres"
        instance_class       = "db.t3.medium" # Scaled down for cost
        allocated_storage    = 100
        storage_type         = "gp2"
        db_subnet_group_name = aws_db_subnet_group.dr_subnet.name
        vpc_security_group_ids = [aws_security_group.dr_sg.id]
        # Crucially, enable read replica functionality pointing to primary
        replicate_source_db  = "arn:aws:rds:us-east-1:123456789012:db:myapp-primary-db"
        skip_final_snapshot  = true
      }
      
  • Automated Failover:
    • DNS Failover: Use services like AWS Route 53 with health checks to automatically redirect traffic to the DR site if the primary site becomes unhealthy.
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "AllowR53ToUpdateRecord",
            "Effect": "Allow",
            "Principal": {
              "Service": "route53.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::123456789012:role/aws-service-role/route53.amazonaws.com/AWSServiceRoleForRoute53",
            "Condition": {
              "StringEquals": {
                "aws:SourceAccount": "123456789012"
              },
              "ArnLike": {
                "aws:SourceArn": "arn:aws:route53:::hostedzone/Z1ABCDEF2GHIJK"
              }
            }
          }
        ]
      }
      
    • Load Balancer Health Checks: Configure load balancers to stop sending traffic to unhealthy instances in the primary region.
    • Orchestration Scripts: Custom scripts or tools like Ansible can automate the process of promoting DR resources (e.g., promoting a read replica to a standalone master).

5. Test, Test, Test:

A DR plan is useless if it hasn’t been tested. Conduct regular DR drills.

  • Tabletop Exercises: Walk through the DR plan verbally.
  • Partial Failover Tests: Test failover for a single component.
  • Full DR Drills: Simulate a complete outage and failover to the DR site. Measure RTO and RPO.

6. Document and Train:

  • Keep the DR plan documentation up-to-date.
  • Train your team on their roles and responsibilities during a disaster.

A key aspect of achieving low RTO/RPO is understanding the lag in your replication. For databases, this might be a few milliseconds for asynchronous replication or seconds for synchronous replication under high load. Your DR strategy must account for this potential lag. If your RPO is 5 minutes, and your replication lag is consistently 4 minutes, you’re already at the edge. If your failover process itself takes 10 minutes, you’ve just missed your RPO target.

The next step is to consider how you handle an unplanned disaster where your primary DR region itself becomes unavailable.

Want structured learning?

Take the full DevOps & Platform Engineering course →