Fintech: Active-Passive Multi-Region Disaster Recovery, RTO < 15 Minutes
A payments processing company had a regulatory requirement for a 15-minute RTO and 5-minute RPO after a regional AWS outage. Their architecture ran entirely in us-east-1 with no DR capability. We designed and implemented an active-passive multi-region setup in 6 weeks.
The Challenge
Financial regulators required documented DR capability with tested failover. The existing architecture had seven interdependent services, an Aurora PostgreSQL cluster, an ElastiCache cluster, and three S3 buckets - all in a single region. The team had never run a DR exercise. The regulatory review was 10 weeks out.
The Approach
We chose active-passive over active-active. Active-active is more resilient but adds significant latency complexity for payment processing and doubles infrastructure costs. Active-passive with automated failover meets the RTO/RPO requirements at roughly 30% of the cost of active-active.
The Implementation
Aurora Global Database
We converted the Aurora PostgreSQL cluster to a Global Database with a secondary cluster in eu-west-1. Aurora Global Database provides sub-second replication lag - observed RPO during testing was consistently under 90 seconds. Failover promotes the secondary in under 60 seconds.
Cross-region ElastiCache replication
We deployed a Global Datastore for ElastiCache (Redis). Session tokens and rate-limit state replicate asynchronously to eu-west-1. On failover, the secondary region accepts connections within 2 minutes. Replication lag was under 500ms in all load tests.
Route 53 health-check failover
We configured Route 53 health checks on the primary region load balancer with a 10-second failure threshold. On three consecutive failures, Route 53 automatically updates the DNS record to the secondary region endpoint. DNS TTL set to 30 seconds.
DR runbook and quarterly exercises
We wrote a DR runbook with step-by-step failover instructions and ran two full DR exercises: the first manual (to validate the process), the second automated (to validate the tooling). The second exercise achieved 12-minute RTO from initial alert to full traffic in eu-west-1.
Key Takeaways
- Aurora Global Database is the right choice for multi-region PostgreSQL - the replication lag and failover time are genuinely production-grade
- Active-passive is the right default for most DR requirements - active-active adds significant complexity that rarely pays off below $10M ARR
- Route 53 health-check failover must be tested under load - DNS behaviour during high traffic is different from lab conditions
- Regulatory reviewers want to see test evidence, not architecture diagrams - run the DR exercise before the review
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.