Skip to content
Cloud CostMarch 8, 20266 min read

Cut Your EKS Bill by 60–80% with Spot Instances and Karpenter

Most EKS clusters run entirely on On-Demand instances and overpay by 60% or more. Here's the exact setup - Karpenter node pools, spot interruption handling, and workload placement - that we use to reduce bills without touching application code.

The average funded startup running Kubernetes on EKS overpays by 40–60% on compute. The reason is almost always the same: everything runs on On-Demand instances, and the cluster is sized for peak load all the time.

Here is how to fix it without rewriting your applications.

Why Spot Instances Save So Much

AWS Spot Instances are spare EC2 capacity sold at a discount - typically 60–90% cheaper than On-Demand. The trade-off: AWS can reclaim them with a 2-minute warning. In Kubernetes, that warning triggers a graceful node drain. Your pods reschedule. Traffic continues.

For most workloads - web APIs, background workers, data pipelines - this interruption is invisible to users if handled correctly.

Why Karpenter Instead of Cluster Autoscaler

Cluster Autoscaler (CA) was the default for years. It works but has limitations:

  • You define node groups upfront with fixed instance types
  • Scaling decisions are reactive and slow (1–2 minutes)
  • Spot diversification requires multiple node groups and complex configuration

Karpenter is the modern replacement:

  • Provisions nodes directly via EC2 Fleet API - no node groups needed
  • Selects the most cost-effective available instance type automatically
  • Responds to pending pods in under 30 seconds
  • Built-in Spot interruption handling via Node Termination Handler integration

The Setup

1. Install Karpenter

bash
helm repo add karpenter https://charts.karpenter.sh helm repo update helm upgrade --install karpenter karpenter/karpenter \ --namespace karpenter \ --create-namespace \ --set settings.clusterName=$CLUSTER_NAME \ --set settings.interruptionQueue=$SQS_QUEUE_NAME \ --set controller.resources.requests.cpu=1 \ --set controller.resources.requests.memory=1Gi

You need an IAM role for Karpenter with EC2 Fleet, EC2 Spot, and IAM permissions. Use the official Terraform module - do not hand-write this.

2. Create Node Pools

The key is creating separate node pools for different workload types:

yaml
# Spot pool for stateless workloads apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-general spec: template: metadata: labels: node-type: spot spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: kubernetes.io/arch operator: In values: ["amd64"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # compute, memory, balanced - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["2"] # modern instance families only nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default limits: cpu: 1000 memory: 4000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m --- # On-Demand pool for stateful / critical workloads apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: on-demand-critical spec: template: metadata: labels: node-type: on-demand spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] - key: karpenter.k8s.aws/instance-category operator: In values: ["m"] - key: karpenter.k8s.aws/instance-size operator: In values: ["large", "xlarge", "2xlarge"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default disruption: consolidationPolicy: WhenEmpty consolidateAfter: 30s

3. Configure Workload Placement

Tell each deployment which pool it belongs in:

yaml
# Stateless API - runs on Spot apiVersion: apps/v1 kind: Deployment metadata: name: api spec: replicas: 3 template: spec: nodeSelector: node-type: spot tolerations: - key: karpenter.sh/capacity-type value: spot effect: NoSchedule topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: api containers: - name: api image: your-api:latest resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi"

The topologySpreadConstraints section is critical - it spreads pods across availability zones, so a Spot interruption in one AZ does not take down all your replicas.

4. Handle Spot Interruptions Gracefully

Install the AWS Node Termination Handler to process Spot interruption notices before AWS reclaims the node:

bash
helm repo add eks https://aws.github.io/eks-charts helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \ --namespace kube-system \ --set enableSpotInterruptionDraining=true \ --set enableScheduledEventDraining=true \ --set enableSqsTerminationDraining=true \ --set queueURL=$SQS_QUEUE_URL

With this in place, when AWS signals a Spot interruption:

  1. The handler cordons the node (no new pods scheduled)
  2. Drains the node (existing pods gracefully evicted)
  3. Karpenter provisions a replacement node in parallel
  4. Your pods reschedule - typically within 60–90 seconds

For most web APIs, this is below the timeout threshold of upstream load balancers. Users see nothing.

What to Keep On-Demand

Not everything should run on Spot. Keep On-Demand for:

  • Stateful workloads (databases if running in K8s, which you probably should not be, but if you are)
  • System-critical pods - cert-manager, external-dns, Karpenter itself
  • Jobs that cannot be interrupted mid-run - billing processors, compliance audit jobs, one-off migrations

A reasonable target is 70–80% Spot, 20–30% On-Demand.

Real Numbers

Here is what this looks like for a typical Series A startup:

BeforeAfter
15x m5.xlarge On-Demand 24/7Karpenter mix: 70% Spot, 30% On-Demand
~$3,200/month compute~$900/month compute
Manual scalingAuto-scaling in <30s
Fixed instance typesBest available from c5, m5, m6i, r5 families

That is ~$2,300/month saved - $27,600/year - without changing a line of application code.

Common Mistakes

Mistake 1: Not setting resource requests/limits. Karpenter sizes nodes based on pod resource requests. If your pods have no requests set, Karpenter cannot make accurate decisions and will either over-provision or leave pods pending.

Mistake 2: Using too few instance types. If you only allow m5.large, Spot availability becomes unpredictable. Open up 3–5 instance families and let Karpenter pick the cheapest available.

Mistake 3: Running stateful apps on Spot without disruption budget. Set a PodDisruptionBudget to ensure at least one replica is always running during node drains:

yaml
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb spec: minAvailable: 1 selector: matchLabels: app: api

Mistake 4: Forgetting consolidation. Karpenter's consolidation policy removes underutilized nodes automatically. Enable WhenEmptyOrUnderutilized on Spot pools to avoid paying for nodes running at 10% utilization.

Getting Started

If you are starting from scratch: install Karpenter, create a Spot node pool and an On-Demand node pool, then migrate workloads over one deployment at a time. Start with your least critical services.

If you already have Cluster Autoscaler and node groups: Karpenter can coexist with CA during migration. Run them in parallel, migrate services to Karpenter-managed nodes, then remove the CA node groups once the migration is complete.


Want us to do this for your cluster? Book a free audit - we will review your current EKS setup and tell you exactly what the savings opportunity looks like.

RK
RKSSH LLP
DevOps Engineer · rkssh.com

I help funded startups fix their CI/CD pipelines and Kubernetes infrastructure. If this post was useful and you want to talk through your specific situation, book a free 30-minute audit.

Related Articles