Cut Your EKS Bill by 60–80% with Spot Instances and Karpenter

The average funded startup running Kubernetes on EKS overpays by 40–60% on compute. The reason is almost always the same: everything runs on On-Demand instances, and the cluster is sized for peak load all the time.

Here is how to fix it without rewriting your applications.

Why Spot Instances Save So Much

AWS Spot Instances are spare EC2 capacity sold at a discount - typically 60–90% cheaper than On-Demand. The trade-off: AWS can reclaim them with a 2-minute warning. In Kubernetes, that warning triggers a graceful node drain. Your pods reschedule. Traffic continues.

For most workloads - web APIs, background workers, data pipelines - this interruption is invisible to users if handled correctly.

Why Karpenter Instead of Cluster Autoscaler

Cluster Autoscaler (CA) was the default for years. It works but has limitations:

•You define node groups upfront with fixed instance types
•Scaling decisions are reactive and slow (1–2 minutes)
•Spot diversification requires multiple node groups and complex configuration

Karpenter is the modern replacement:

•Provisions nodes directly via EC2 Fleet API - no node groups needed
•Selects the most cost-effective available instance type automatically
•Responds to pending pods in under 30 seconds
•Built-in Spot interruption handling via Node Termination Handler integration

The Setup

1. Install Karpenter

bash
helm repo add karpenter https://charts.karpenter.sh
helm repo update

helm upgrade --install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --set settings.clusterName=$CLUSTER_NAME \
  --set settings.interruptionQueue=$SQS_QUEUE_NAME \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi

You need an IAM role for Karpenter with EC2 Fleet, EC2 Spot, and IAM permissions. Use the official Terraform module - do not hand-write this.

2. Create Node Pools

The key is creating separate node pools for different workload types:

yaml
# Spot pool for stateless workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-general
spec:
  template:
    metadata:
      labels:
        node-type: spot
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]          # compute, memory, balanced
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]                     # modern instance families only
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: 1000
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
---
# On-Demand pool for stateful / critical workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: on-demand-critical
spec:
  template:
    metadata:
      labels:
        node-type: on-demand
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

3. Configure Workload Placement

Tell each deployment which pool it belongs in:

yaml
# Stateless API - runs on Spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      nodeSelector:
        node-type: spot
      tolerations:
        - key: karpenter.sh/capacity-type
          value: spot
          effect: NoSchedule
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api
      containers:
        - name: api
          image: your-api:latest
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

The topologySpreadConstraints section is critical - it spreads pods across availability zones, so a Spot interruption in one AZ does not take down all your replicas.

4. Handle Spot Interruptions Gracefully

Install the AWS Node Termination Handler to process Spot interruption notices before AWS reclaims the node:

bash
helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true \
  --set enableSqsTerminationDraining=true \
  --set queueURL=$SQS_QUEUE_URL

With this in place, when AWS signals a Spot interruption:

•The handler cordons the node (no new pods scheduled)
•Drains the node (existing pods gracefully evicted)
•Karpenter provisions a replacement node in parallel
•Your pods reschedule - typically within 60–90 seconds

For most web APIs, this is below the timeout threshold of upstream load balancers. Users see nothing.

What to Keep On-Demand

Not everything should run on Spot. Keep On-Demand for:

•Stateful workloads (databases if running in K8s, which you probably should not be, but if you are)
•System-critical pods - cert-manager, external-dns, Karpenter itself
•Jobs that cannot be interrupted mid-run - billing processors, compliance audit jobs, one-off migrations

A reasonable target is 70–80% Spot, 20–30% On-Demand.

Real Numbers

Here is what this looks like for a typical Series A startup:

Before	After
15x m5.xlarge On-Demand 24/7	Karpenter mix: 70% Spot, 30% On-Demand
~$3,200/month compute	~$900/month compute
Manual scaling	Auto-scaling in <30s
Fixed instance types	Best available from c5, m5, m6i, r5 families

That is ~$2,300/month saved - $27,600/year - without changing a line of application code.

Common Mistakes

Mistake 1: Not setting resource requests/limits. Karpenter sizes nodes based on pod resource requests. If your pods have no requests set, Karpenter cannot make accurate decisions and will either over-provision or leave pods pending.

Mistake 2: Using too few instance types. If you only allow m5.large, Spot availability becomes unpredictable. Open up 3–5 instance families and let Karpenter pick the cheapest available.

Mistake 3: Running stateful apps on Spot without disruption budget. Set a PodDisruptionBudget to ensure at least one replica is always running during node drains:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: api

Mistake 4: Forgetting consolidation. Karpenter's consolidation policy removes underutilized nodes automatically. Enable WhenEmptyOrUnderutilized on Spot pools to avoid paying for nodes running at 10% utilization.

Getting Started

If you are starting from scratch: install Karpenter, create a Spot node pool and an On-Demand node pool, then migrate workloads over one deployment at a time. Start with your least critical services.

If you already have Cluster Autoscaler and node groups: Karpenter can coexist with CA during migration. Run them in parallel, migrate services to Karpenter-managed nodes, then remove the CA node groups once the migration is complete.

Want us to do this for your cluster? Book a free audit - we will review your current EKS setup and tell you exactly what the savings opportunity looks like.