Gaming2025-03

Mobile Gaming Studio: Zero-Downtime Auto-Scaling for 2M Player Launch

A mobile game studio was preparing for a global launch expected to spike to 2 million concurrent players. Their backend - a Go API on fixed EC2 instances - had failed a 200K user load test two months before launch. We rebuilt the scaling layer in 5 weeks.

Deploy Time

Failed at 200K concurrent users

Sustained 2M concurrent users

Deploy Frequency

N/A

Incidents

DB crash at 200K users (8 min load test)

Zero incidents during 2M player launch

Cost Impact

Launch day revenue protected - $1.4M in first 48h

The Challenge

The game backend had been built to run, not to scale. Database connections were not pooled - each API instance opened its own connection directly to RDS. The game state service held in-memory session state that could not be shared across instances. A 200K user load test had brought down the database within 8 minutes. Launch was 8 weeks away.

The Approach

We worked backwards from the launch capacity target: 2M concurrent players, P99 API latency under 100ms, zero data loss on instance termination. The critical path was connection pooling, stateless session management, and horizontal scaling with proper health checks. We did not redesign the application - we changed the infrastructure layer around it.

The Implementation

PgBouncer connection pooling

We deployed PgBouncer in transaction pooling mode in front of RDS. The database connection count went from one per application thread (2,400 at 200 instances × 12 threads) to a fixed pool of 100 connections. The database stopped timing out under load.

PgBouncerAWS RDSTerraform

Redis session migration

We moved game session state from in-memory Go maps to Redis Cluster on ElastiCache. The migration took 2 days - the Go API changes were localized to a single session.Store interface. Instances became stateless and could scale horizontally.

AWS ElastiCache RedisGoTerraform

ECS auto-scaling with predictive warm-up

We migrated the Go API from fixed EC2 to ECS Fargate with target-tracking auto-scaling on CPU and concurrent connection metrics. We added a launch-day scaling schedule to pre-warm 500 Fargate tasks 1 hour before the global launch window.

AWS ECS FargateAWS Application Auto ScalingCloudWatch

Load test validation

We ran three progressive load tests: 200K, 800K, and 1.5M simulated concurrent players using k6 with a realistic session simulation script. All three passed. The 1.5M test ran for 20 minutes at steady state with P99 latency of 47ms.

k6AWS CloudWatchGrafana

Key Takeaways

Database connection exhaustion is the most common scaling failure - PgBouncer is a one-day fix that should be standard for any API using RDS
Stateless application design is a prerequisite for horizontal scaling - in-memory session state is the most common blocker
Pre-warming ECS tasks before a known traffic spike eliminates the cold-start lag in auto-scaling response
Load test at 150% of target capacity, not 100% - production always surprises you in the first hour

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies