The average funded startup running Kubernetes on EKS overpays by 40–60% on compute. The reason is almost always the same: everything runs on On-Demand instances, and the cluster is sized for peak load all the time.
Here is how to fix it without rewriting your applications.
Why Spot Instances Save So Much
AWS Spot Instances are spare EC2 capacity sold at a discount - typically 60–90% cheaper than On-Demand. The trade-off: AWS can reclaim them with a 2-minute warning. In Kubernetes, that warning triggers a graceful node drain. Your pods reschedule. Traffic continues.
For most workloads - web APIs, background workers, data pipelines - this interruption is invisible to users if handled correctly.
Why Karpenter Instead of Cluster Autoscaler
Cluster Autoscaler (CA) was the default for years. It works but has limitations:
- •You define node groups upfront with fixed instance types
- •Scaling decisions are reactive and slow (1–2 minutes)
- •Spot diversification requires multiple node groups and complex configuration
Karpenter is the modern replacement:
- •Provisions nodes directly via EC2 Fleet API - no node groups needed
- •Selects the most cost-effective available instance type automatically
- •Responds to pending pods in under 30 seconds
- •Built-in Spot interruption handling via Node Termination Handler integration
The Setup
1. Install Karpenter
bashhelm repo add karpenter https://charts.karpenter.sh helm repo update helm upgrade --install karpenter karpenter/karpenter \ --namespace karpenter \ --create-namespace \ --set settings.clusterName=$CLUSTER_NAME \ --set settings.interruptionQueue=$SQS_QUEUE_NAME \ --set controller.resources.requests.cpu=1 \ --set controller.resources.requests.memory=1Gi
You need an IAM role for Karpenter with EC2 Fleet, EC2 Spot, and IAM permissions. Use the official Terraform module - do not hand-write this.
2. Create Node Pools
The key is creating separate node pools for different workload types:
yaml# Spot pool for stateless workloads apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-general spec: template: metadata: labels: node-type: spot spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: kubernetes.io/arch operator: In values: ["amd64"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # compute, memory, balanced - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["2"] # modern instance families only nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default limits: cpu: 1000 memory: 4000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m --- # On-Demand pool for stateful / critical workloads apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: on-demand-critical spec: template: metadata: labels: node-type: on-demand spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] - key: karpenter.k8s.aws/instance-category operator: In values: ["m"] - key: karpenter.k8s.aws/instance-size operator: In values: ["large", "xlarge", "2xlarge"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default disruption: consolidationPolicy: WhenEmpty consolidateAfter: 30s
3. Configure Workload Placement
Tell each deployment which pool it belongs in:
yaml# Stateless API - runs on Spot apiVersion: apps/v1 kind: Deployment metadata: name: api spec: replicas: 3 template: spec: nodeSelector: node-type: spot tolerations: - key: karpenter.sh/capacity-type value: spot effect: NoSchedule topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: api containers: - name: api image: your-api:latest resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi"
The topologySpreadConstraints section is critical - it spreads pods across availability zones, so a Spot interruption in one AZ does not take down all your replicas.
4. Handle Spot Interruptions Gracefully
Install the AWS Node Termination Handler to process Spot interruption notices before AWS reclaims the node:
bashhelm repo add eks https://aws.github.io/eks-charts helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \ --namespace kube-system \ --set enableSpotInterruptionDraining=true \ --set enableScheduledEventDraining=true \ --set enableSqsTerminationDraining=true \ --set queueURL=$SQS_QUEUE_URL
With this in place, when AWS signals a Spot interruption:
- •The handler cordons the node (no new pods scheduled)
- •Drains the node (existing pods gracefully evicted)
- •Karpenter provisions a replacement node in parallel
- •Your pods reschedule - typically within 60–90 seconds
For most web APIs, this is below the timeout threshold of upstream load balancers. Users see nothing.
What to Keep On-Demand
Not everything should run on Spot. Keep On-Demand for:
- •Stateful workloads (databases if running in K8s, which you probably should not be, but if you are)
- •System-critical pods - cert-manager, external-dns, Karpenter itself
- •Jobs that cannot be interrupted mid-run - billing processors, compliance audit jobs, one-off migrations
A reasonable target is 70–80% Spot, 20–30% On-Demand.
Real Numbers
Here is what this looks like for a typical Series A startup:
| Before | After |
|---|---|
| 15x m5.xlarge On-Demand 24/7 | Karpenter mix: 70% Spot, 30% On-Demand |
| ~$3,200/month compute | ~$900/month compute |
| Manual scaling | Auto-scaling in <30s |
| Fixed instance types | Best available from c5, m5, m6i, r5 families |
That is ~$2,300/month saved - $27,600/year - without changing a line of application code.
Common Mistakes
Mistake 1: Not setting resource requests/limits. Karpenter sizes nodes based on pod resource requests. If your pods have no requests set, Karpenter cannot make accurate decisions and will either over-provision or leave pods pending.
Mistake 2: Using too few instance types.
If you only allow m5.large, Spot availability becomes unpredictable. Open up 3–5 instance families and let Karpenter pick the cheapest available.
Mistake 3: Running stateful apps on Spot without disruption budget. Set a PodDisruptionBudget to ensure at least one replica is always running during node drains:
yamlapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb spec: minAvailable: 1 selector: matchLabels: app: api
Mistake 4: Forgetting consolidation.
Karpenter's consolidation policy removes underutilized nodes automatically. Enable WhenEmptyOrUnderutilized on Spot pools to avoid paying for nodes running at 10% utilization.
Getting Started
If you are starting from scratch: install Karpenter, create a Spot node pool and an On-Demand node pool, then migrate workloads over one deployment at a time. Start with your least critical services.
If you already have Cluster Autoscaler and node groups: Karpenter can coexist with CA during migration. Run them in parallel, migrate services to Karpenter-managed nodes, then remove the CA node groups once the migration is complete.
Want us to do this for your cluster? Book a free audit - we will review your current EKS setup and tell you exactly what the savings opportunity looks like.