How to Deploy LLMs to Kubernetes with vLLM: A Production Guide

The question has shifted. It used to be: "can we fine-tune a model?" Now it is: "how do we serve it to 10,000 users at acceptable latency without a $50,000/month GPU bill?"

This guide covers the production infrastructure side - Kubernetes, vLLM, GPU node provisioning, autoscaling, and request routing. Not the model training side.

Why vLLM

vLLM is the de facto standard for serving open-weight LLMs in production. Its core innovation is PagedAttention - a memory management technique borrowed from operating system paging that allows the KV cache to be stored in non-contiguous blocks. The result:

•Up to 24× higher throughput than naive HuggingFace Transformers serving
•Higher GPU memory utilisation (more concurrent requests per GPU)
•OpenAI-compatible API - drop-in replacement for applications using the OpenAI client

Most AI startups we work with use vLLM to serve Llama, Mistral, Qwen, or fine-tuned variants of these models.

Prerequisites

You need:

•A Kubernetes cluster with GPU node support (EKS with g4dn/g5/p4d instances, GKE with A100/L4 nodes, or AKS with NC-series)
•NVIDIA GPU Operator installed on the cluster
•A model downloaded to a shared volume or accessible via S3/GCS
•At least one GPU with 16GB+ VRAM for 7B models (80GB for 70B models)

Step 1: GPU Node Pool

On EKS, add a GPU node group:

hcl
module "eks_gpu_nodegroup" {
  source = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"

  name           = "gpu-inference"
  cluster_name   = module.eks.cluster_name

  instance_types = ["g5.xlarge"]   # 1x A10G GPU, 24GB VRAM - good for 7B models
  min_size       = 0
  max_size       = 10
  desired_size   = 1

  labels = {
    "node-type" = "gpu"
    "workload"  = "inference"
  }

  taints = {
    gpu = {
      key    = "nvidia.com/gpu"
      value  = "true"
      effect = "NO_SCHEDULE"
    }
  }

  ami_type = "AL2_x86_64_GPU"
}

Install the NVIDIA GPU Operator:

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true

Verify GPUs are visible:

bash
kubectl get nodes -l node-type=gpu
kubectl describe node <gpu-node> | grep -A5 "Allocatable:"
# Should show: nvidia.com/gpu: 1

Step 2: Model Storage

Store your model on S3 and use an init container to pull it, or use a PersistentVolume backed by EFS/NFS for shared access across replicas.

For EFS-based model storage (recommended for multi-replica setups):

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: inference
spec:
  accessModes:
    - ReadWriteMany    # EFS supports multi-pod access
  storageClassName: efs-sc
  resources:
    requests:
      storage: 50Gi

Pre-populate the model:

bash
# One-time: download model to PVC via a temporary pod
kubectl run model-loader \
  --image=python:3.11 \
  --overrides='{"spec":{"volumes":[{"name":"models","persistentVolumeClaim":{"claimName":"model-storage"}}],"containers":[{"name":"model-loader","image":"python:3.11","command":["bash","-c","pip install huggingface_hub && python -c \"from huggingface_hub import snapshot_download; snapshot_download(repo_id='"'"'meta-llama/Llama-3.1-8B-Instruct'"'"', local_dir='"'"'/models/llama-3.1-8b'"'"')\""],"volumeMounts":[{"name":"models","mountPath":"/models"}]}]}}' \
  --restart=Never

Step 3: vLLM Deployment

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        node-type: gpu
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-storage
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi     # shared memory for tensor parallel
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model=/models/llama-3.1-8b
            - --host=0.0.0.0
            - --port=8000
            - --tensor-parallel-size=1       # increase for multi-GPU
            - --max-model-len=8192
            - --max-num-batched-tokens=32768
            - --gpu-memory-utilization=0.90
            - --served-model-name=llama-3.1-8b
          env:
            - name: VLLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: vllm-secrets
                  key: api-key
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: 32Gi
              cpu: "8"
            requests:
              nvidia.com/gpu: "1"
              memory: 24Gi
              cpu: "4"
          volumeMounts:
            - name: model-storage
              mountPath: /models
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120    # model loading takes time
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30

Service and Ingress:

yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: inference
spec:
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  rules:
    - host: llm-api.yourdomain.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80

Step 4: Autoscaling with KEDA

Standard HPA cannot scale on GPU utilisation or request queue depth. Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus metric:

bash
helm repo add kedacore https://kedacore.github.io/charts
helm upgrade --install keda kedacore/keda --namespace keda --create-namespace

Scale based on pending request queue (via Prometheus metric from vLLM):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 5
  cooldownPeriod: 300         # seconds before scale-down (GPU nodes are expensive)
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: vllm_pending_requests
        query: sum(vllm_request_waiting_total)
        threshold: "10"       # scale up when 10+ requests waiting

Step 5: Cost Controls

GPU instances are expensive. A g5.xlarge is ~$1/hour on On-Demand. Controls to implement:

Scale to zero when idle: Set minReplicaCount: 0 in KEDA. GPU nodes scale down when no inference is happening. New requests trigger a scale-up (30–60 second cold start for GPU node provisioning + model loading).

Use Spot GPU instances for non-latency-critical workloads: GPU Spot instances have lower interruption rates than CPU instances. For batch inference or background processing, Spot g5 instances save 60–70%.

Set request rate limits at the ingress layer:

yaml
metadata:
  annotations:
    nginx.ingress.kubernetes.io/limit-rps: "10"
    nginx.ingress.kubernetes.io/limit-connections: "50"

Calling the API

vLLM exposes an OpenAI-compatible API. Your application code needs zero changes if it already uses the OpenAI client:

python
from openai import OpenAI

client = OpenAI(
    api_key="your-vllm-api-key",
    base_url="https://llm-api.yourdomain.com/v1",
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "user", "content": "Explain Kubernetes in one paragraph."}
    ],
    max_tokens=512,
    temperature=0.7,
)

What This Costs

For a typical AI startup serving 1M tokens/day on Llama 3.1 8B:

•1× g5.xlarge at ~$1/hr, running ~12 hours/day (KEDA scale to zero overnight)
•~$360/month compute
•Plus EFS storage (~$15/month for 50GB model)
•Plus data transfer

Compare to OpenAI API at ~$0.15/1M input tokens + $0.60/1M output tokens - at 1M tokens/day you are paying $225–$750/month and have no data privacy guarantees.

Self-hosted LLM on Kubernetes makes financial and privacy sense from roughly 500K tokens/day onward.

Building an AI product and need production-grade LLM infrastructure? Book a free audit - we will scope the infrastructure for your specific model and traffic profile.