SaaS2025-09

SaaS Platform: Scaling from 5 to 50 Engineers

A B2B SaaS company grew from 5 to 50 engineers over 18 months. The deployment infrastructure did not scale with the team - by the time they engaged us, deploys were blocking multiple teams and the monolith was starting to split into services.

Deploy Time

25 minutes

4 minutes

Deploy Frequency

8/week (all teams combined)

80+/week

Incidents

Platform team blocked 40% of deploys

Platform team unblocks <2% of deploys

Cost Impact

$12K/month via Spot instances

The Challenge

What works for 5 engineers deploying one application breaks catastrophically at 50 engineers deploying 12 services. The team had a working CI pipeline but no GitOps, no environment management, and no way for multiple teams to deploy independently without stepping on each other. The platform team was spending 40% of their time unblocking other teams on deployment issues instead of building platform capabilities.

The Approach

This engagement was less about fixing broken infrastructure and more about building a scalable platform that multiple product teams could use independently. We introduced GitOps with ArgoCD, built per-team namespaces in Kubernetes, standardized on Helm charts, and created a self-service deployment process that eliminated the platform team as a bottleneck.

The Implementation

Kubernetes cluster setup on EKS

We provisioned an EKS cluster using Terraform with separate node groups for production workloads, batch jobs, and spot instances. We set up cluster autoscaler, AWS Load Balancer Controller, external-dns, and cert-manager.

AWS EKSTerraformHelmcluster-autoscaler

GitOps with ArgoCD

We deployed ArgoCD and set up an Application of Applications pattern. Each team owns their ArgoCD Application manifest. When they merge to main, ArgoCD automatically syncs their services to the cluster. No central approval bottleneck.

ArgoCDHelmGitHub

Standardized Helm chart library

We created a base Helm chart that encodes the company's deployment standards: resource requests/limits, liveness/readiness probes, PodDisruptionBudget, HPA configuration, and security contexts. Teams extend the base chart rather than writing from scratch.

HelmOCI Registry

Observability stack

We deployed Prometheus, Grafana, and Loki. Built per-team dashboards with SLO tracking, and set up PagerDuty routing so each team owns their on-call for their services.

PrometheusGrafanaLokiPagerDuty

Key Takeaways

GitOps eliminates the platform team as a deploy bottleneck - the most impactful architecture decision in this engagement
Standardized Helm base charts reduce per-service setup time from days to hours
Spot instances for non-critical workloads saved $12K/month without any reliability impact
Per-team on-call ownership combined with team-specific dashboards reduced MTTR by 70%

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies