RetailTech Platform: $22K AWS Bill to $10.8K in 6 Weeks
A retail analytics SaaS company had watched their AWS bill grow from $8K to $22K/month over two years without a corresponding growth in customers or revenue. A structured cost audit identified $11K/month in waste across five categories. The fix took 6 weeks and required no application code changes.
The Challenge
The engineering team had optimised for velocity and never looked at the bill systematically. The result: a data pipeline running 24/7 on on-demand r5.4xlarge instances processing 4-hour batches, ElastiCache clusters sized for the previous year's peak traffic, RDS instances with zero Savings Plans, and a log aggregation setup shipping 2TB/day to CloudWatch Logs at $0.50/GB ingestion.
The Approach
We ran a 3-day cost discovery sprint: pull the AWS Cost and Usage Report, tag every resource with service and team, identify top-10 cost drivers, then model the savings from each optimisation. We presented a prioritised fix list sorted by impact-to-effort ratio and implemented in that order.
The Implementation
Data pipeline: on-demand to Spot with EMR
The Spark data pipeline ran on 8× r5.4xlarge On-Demand instances for 4 hours every day. We migrated it to EMR on EC2 with a mixed Spot/On-Demand fleet (80/20), provisioned only during processing windows via a scheduled EventBridge trigger. The pipeline now runs the same job at 73% lower cost.
ElastiCache right-sizing and reserved nodes
Three ElastiCache clusters were sized at the previous year's peak with no autoscaling. We right-sized two from cache.r6g.2xlarge to cache.r6g.large (current P95 memory usage was 22% of capacity), and purchased 1-year Reserved Nodes for all three. Total ElastiCache savings: $3,200/month.
CloudWatch Logs to S3 + Athena
Application logs were being shipped directly to CloudWatch at $0.50/GB ingestion. At 2TB/day that was $30,000/month - more than the rest of the bill combined. We implemented a tiered logging strategy: errors and warnings to CloudWatch (50GB/day), full logs exported via Kinesis Firehose to S3 (queried via Athena when needed). Log ingestion cost dropped from $30K to $750/month.
RDS Savings Plans and idle instance decommission
Five RDS instances had no reserved capacity - all On-Demand. We purchased 1-year Reserved Instances for three production databases (34% discount) and identified two development databases that had not had a connection in 47 days. We created snapshots and terminated the instances.
Key Takeaways
- CloudWatch Logs ingestion cost is invisible until you look - at $0.50/GB it is the single most common surprise in AWS cost audits
- Spot instances for batch workloads are the highest-ROI change in most data infrastructure audits - 60–80% savings with zero application changes
- Reserved Instances and Savings Plans should be reviewed every quarter - teams consistently leave this on the table for 12–18 months
- Two idle RDS instances had been running for 47 days consuming $1,400/month - a weekly cost audit would have caught this in week two
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.