Fintech2024-09

Fintech: Active-Passive Multi-Region Disaster Recovery, RTO < 15 Minutes

A payments processing company had a regulatory requirement for a 15-minute RTO and 5-minute RPO after a regional AWS outage. Their architecture ran entirely in us-east-1 with no DR capability. We designed and implemented an active-passive multi-region setup in 6 weeks.

Deploy Time

N/A

Deploy Frequency

N/A

Incidents

No DR capability, single region

RTO: 12 min, RPO: 3 min - regulatory requirement met

Cost Impact

Regulatory approval obtained, $4M enterprise contract unblocked

The Challenge

Financial regulators required documented DR capability with tested failover. The existing architecture had seven interdependent services, an Aurora PostgreSQL cluster, an ElastiCache cluster, and three S3 buckets - all in a single region. The team had never run a DR exercise. The regulatory review was 10 weeks out.

The Approach

We chose active-passive over active-active. Active-active is more resilient but adds significant latency complexity for payment processing and doubles infrastructure costs. Active-passive with automated failover meets the RTO/RPO requirements at roughly 30% of the cost of active-active.

The Implementation

Aurora Global Database

We converted the Aurora PostgreSQL cluster to a Global Database with a secondary cluster in eu-west-1. Aurora Global Database provides sub-second replication lag - observed RPO during testing was consistently under 90 seconds. Failover promotes the secondary in under 60 seconds.

AWS Aurora Global DatabasePostgreSQLTerraform

Cross-region ElastiCache replication

We deployed a Global Datastore for ElastiCache (Redis). Session tokens and rate-limit state replicate asynchronously to eu-west-1. On failover, the secondary region accepts connections within 2 minutes. Replication lag was under 500ms in all load tests.

AWS ElastiCache Global DatastoreRedisTerraform

Route 53 health-check failover

We configured Route 53 health checks on the primary region load balancer with a 10-second failure threshold. On three consecutive failures, Route 53 automatically updates the DNS record to the secondary region endpoint. DNS TTL set to 30 seconds.

AWS Route 53AWS Health ChecksTerraform

DR runbook and quarterly exercises

We wrote a DR runbook with step-by-step failover instructions and ran two full DR exercises: the first manual (to validate the process), the second automated (to validate the tooling). The second exercise achieved 12-minute RTO from initial alert to full traffic in eu-west-1.

AWS Systems ManagerNotionPagerDuty

Key Takeaways

Aurora Global Database is the right choice for multi-region PostgreSQL - the replication lag and failover time are genuinely production-grade
Active-passive is the right default for most DR requirements - active-active adds significant complexity that rarely pays off below $10M ARR
Route 53 health-check failover must be tested under load - DNS behaviour during high traffic is different from lab conditions
Regulatory reviewers want to see test evidence, not architecture diagrams - run the DR exercise before the review

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies