The typical startup discovers its cloud spend problem one of two ways: the CFO sees a $40,000 AWS bill for the first time, or someone gets paged because the AWS budget alert fired at 180% of plan.
By that point, costs have been growing for months. The pattern is always the same: infrastructure provisioned for a load test that was never torn down, oversized RDS instances from an early architect who "wanted headroom", and Kubernetes clusters running at 15% average utilisation.
FinOps is the practice of making cloud cost a first-class engineering concern - tracked, owned, and optimised continuously by the engineering team, not discovered quarterly by finance.
The Three Questions FinOps Answers
- •What are we spending? - Complete, tagged, attributable spend data
- •What are we getting for it? - Unit economics (cost per request, cost per customer)
- •What should we cut? - Idle resources, over-provisioned instances, unused reservations
Most teams can answer question 1 poorly and have no answer for questions 2 and 3.
Step 1: Tag Everything
Cost attribution requires tags. Without tags, you know you spent $80,000 last month - but not which team, service, or environment caused it.
Enforce tags in Terraform:
hcl# Define required tags in a locals block locals { required_tags = { Environment = var.environment # production, staging, dev Team = var.team # backend, data, platform Service = var.service_name # api, worker, ml-pipeline ManagedBy = "terraform" } } # Apply to all resources via provider default_tags provider "aws" { region = "us-east-1" default_tags { tags = local.required_tags } }
Enable AWS Cost Allocation Tags for Environment, Team, Service in the billing console. It takes 24 hours for tags to appear in Cost Explorer.
Use AWS Organizations Service Control Policies to prevent creating resources without tags:
json{ "Version": "2012-10-17", "Statement": [ { "Sid": "RequireTagsOnEC2", "Effect": "Deny", "Action": ["ec2:RunInstances"], "Resource": "arn:aws:ec2:*:*:instance/*", "Condition": { "Null": { "aws:RequestedRegion": "false", "ec2:ResourceTag/Environment": "true" } } } ] }
Step 2: Build a Cost Dashboard
AWS Cost Explorer is good but not actionable. Build a Grafana dashboard that shows cost alongside engineering metrics - request volume, active users, deployments.
Use the AWS Cost and Usage Report (CUR) to get detailed billing data, and Athena to query it:
sql-- Daily cost by service (last 30 days) SELECT line_item_product_code, resource_tags_user_service, DATE(line_item_usage_start_date) as date, SUM(line_item_unblended_cost) as cost FROM cur_report WHERE line_item_usage_start_date >= DATE_ADD('day', -30, NOW()) AND line_item_line_item_type = 'Usage' GROUP BY 1, 2, 3 ORDER BY date DESC, cost DESC
Connect this to Grafana via Athena data source, and you have a daily cost breakdown per service on your engineering dashboards.
The goal: engineers open the same dashboard for latency, error rate, and cloud cost. Cost is an engineering metric, not a finance report.
Step 3: Find and Kill Idle Resources
This is where most of the savings are. The common culprits:
Idle EC2 instances:
bash# Find instances with <5% CPU over the last 2 weeks aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --period 1209600 \ --statistics Average \ --dimensions Name=InstanceId,Value=i-1234567890 \ --start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
Or use AWS Compute Optimizer - it analyses usage patterns and recommends rightsizing automatically.
Unattached EBS volumes:
bashaws ec2 describe-volumes \ --filters Name=status,Values=available \ --query "Volumes[*].{ID:VolumeId,Size:Size,Cost:Size}"
available status means not attached to any instance. These are often orphaned from terminated instances.
Oversized RDS instances:
RDS is usually the biggest cost after EC2/EKS. Compare your instance class against actual CPU and connection metrics in CloudWatch. An r6g.4xlarge at 5% CPU average is a r6g.large waiting to happen.
Unused Elastic IPs:
bashaws ec2 describe-addresses \ --query "Addresses[?!AssociationId].[AllocationId,PublicIp]"
Unassociated EIPs cost $0.005/hour each - small individually but worth cleaning up.
Zombie environments: Run a weekly audit of all EC2, RDS, and EKS clusters. Any environment without a corresponding active GitHub branch or business reason gets flagged for deletion.
Step 4: Right-size Kubernetes
Kubernetes over-provisioning is silent but expensive. The issue: teams set resource requests too high "to be safe", and the cluster autoscaler scales up nodes to accommodate the requests - even though actual usage is a fraction of the request.
Check actual vs requested CPU per namespace:
bashkubectl top pods -A --sort-by=cpu | head -30
Then compare against resource requests:
bashkubectl describe pod <pod-name> | grep -A3 "Limits\|Requests"
If actual CPU usage is consistently 10–20% of the request, cut the request. The cluster can fit more pods per node, reducing the node count.
Use Goldilocks to get automated recommendations:
bashhelm repo add fairwinds-stable https://charts.fairwinds.com/stable helm upgrade --install goldilocks fairwinds-stable/goldilocks \ --namespace goldilocks \ --create-namespace kubectl label namespace production goldilocks.fairwinds.com/enabled=true
Goldilocks runs VPA (Vertical Pod Autoscaler) in recommendation mode and generates a dashboard showing optimal resource requests per pod based on actual usage. Visit localhost:8080 after port-forwarding:
bashkubectl port-forward svc/goldilocks-dashboard 8080:80 -n goldilocks
Step 5: Savings Plans and Reserved Instances
For stable, predictable workloads, Compute Savings Plans give up to 66% discount vs On-Demand in exchange for a 1- or 3-year commitment.
The analysis:
- •Look at your 30-day average On-Demand spend in Cost Explorer
- •Identify the stable baseline - the compute you run 24/7 regardless of load
- •Buy Savings Plans to cover that baseline
- •Let Spot instances cover the variable load on top
Do not commit more than your stable baseline. Savings Plans have no refunds. The right coverage for most startups is 50–70% of average compute spend.
The Unit Economics Number That Matters
Every FinOps practice ultimately aims at one number: cost per unit of value delivered.
Depending on your business model:
- •SaaS: cost per customer per month (cloud cost / active customers)
- •API: cost per 1,000 API calls
- •Marketplace: cost per transaction processed
Build this metric into your engineering dashboard. When engineering decisions increase feature velocity but also increase cloud cost proportionally, that is fine. When cloud cost grows faster than the unit of value, that is the signal to investigate.
For most Series A startups, a healthy range is $10–$30 cloud cost per customer per month (depending on infrastructure-intensity of the product).
Where to Start This Week
- •
Enable AWS Cost Anomaly Detection - free, and it will email you when spend spikes unexpectedly. Set a threshold at 20% over week-over-week average.
- •
Tag your top 5 most expensive resources - go to Cost Explorer today, find the five biggest line items, trace them to a team and service, and tag them.
- •
Find your three biggest idle resources - unattached EBS volumes, stopped EC2 instances, dev environments running on weekends.
These three steps take a day and typically surface $3K–$8K/month in immediate savings.
Want a full FinOps audit of your AWS/GCP/Azure environment? Book a free audit - we will identify your top cost reduction opportunities and give you a prioritised fix list.