Mobile Gaming Studio: Zero-Downtime Auto-Scaling for 2M Player Launch
A mobile game studio was preparing for a global launch expected to spike to 2 million concurrent players. Their backend - a Go API on fixed EC2 instances - had failed a 200K user load test two months before launch. We rebuilt the scaling layer in 5 weeks.
The Challenge
The game backend had been built to run, not to scale. Database connections were not pooled - each API instance opened its own connection directly to RDS. The game state service held in-memory session state that could not be shared across instances. A 200K user load test had brought down the database within 8 minutes. Launch was 8 weeks away.
The Approach
We worked backwards from the launch capacity target: 2M concurrent players, P99 API latency under 100ms, zero data loss on instance termination. The critical path was connection pooling, stateless session management, and horizontal scaling with proper health checks. We did not redesign the application - we changed the infrastructure layer around it.
The Implementation
PgBouncer connection pooling
We deployed PgBouncer in transaction pooling mode in front of RDS. The database connection count went from one per application thread (2,400 at 200 instances × 12 threads) to a fixed pool of 100 connections. The database stopped timing out under load.
Redis session migration
We moved game session state from in-memory Go maps to Redis Cluster on ElastiCache. The migration took 2 days - the Go API changes were localized to a single session.Store interface. Instances became stateless and could scale horizontally.
ECS auto-scaling with predictive warm-up
We migrated the Go API from fixed EC2 to ECS Fargate with target-tracking auto-scaling on CPU and concurrent connection metrics. We added a launch-day scaling schedule to pre-warm 500 Fargate tasks 1 hour before the global launch window.
Load test validation
We ran three progressive load tests: 200K, 800K, and 1.5M simulated concurrent players using k6 with a realistic session simulation script. All three passed. The 1.5M test ran for 20 minutes at steady state with P99 latency of 47ms.
Key Takeaways
- Database connection exhaustion is the most common scaling failure - PgBouncer is a one-day fix that should be standard for any API using RDS
- Stateless application design is a prerequisite for horizontal scaling - in-memory session state is the most common blocker
- Pre-warming ECS tasks before a known traffic spike eliminates the cold-start lag in auto-scaling response
- Load test at 150% of target capacity, not 100% - production always surprises you in the first hour
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.