The question has shifted. It used to be: "can we fine-tune a model?" Now it is: "how do we serve it to 10,000 users at acceptable latency without a $50,000/month GPU bill?"
This guide covers the production infrastructure side - Kubernetes, vLLM, GPU node provisioning, autoscaling, and request routing. Not the model training side.
Why vLLM
vLLM is the de facto standard for serving open-weight LLMs in production. Its core innovation is PagedAttention - a memory management technique borrowed from operating system paging that allows the KV cache to be stored in non-contiguous blocks. The result:
- •Up to 24× higher throughput than naive HuggingFace Transformers serving
- •Higher GPU memory utilisation (more concurrent requests per GPU)
- •OpenAI-compatible API - drop-in replacement for applications using the OpenAI client
Most AI startups we work with use vLLM to serve Llama, Mistral, Qwen, or fine-tuned variants of these models.
Prerequisites
You need:
- •A Kubernetes cluster with GPU node support (EKS with g4dn/g5/p4d instances, GKE with A100/L4 nodes, or AKS with NC-series)
- •NVIDIA GPU Operator installed on the cluster
- •A model downloaded to a shared volume or accessible via S3/GCS
- •At least one GPU with 16GB+ VRAM for 7B models (80GB for 70B models)
Step 1: GPU Node Pool
On EKS, add a GPU node group:
hclmodule "eks_gpu_nodegroup" { source = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group" name = "gpu-inference" cluster_name = module.eks.cluster_name instance_types = ["g5.xlarge"] # 1x A10G GPU, 24GB VRAM - good for 7B models min_size = 0 max_size = 10 desired_size = 1 labels = { "node-type" = "gpu" "workload" = "inference" } taints = { gpu = { key = "nvidia.com/gpu" value = "true" effect = "NO_SCHEDULE" } } ami_type = "AL2_x86_64_GPU" }
Install the NVIDIA GPU Operator:
bashhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm upgrade --install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --set driver.enabled=true \ --set toolkit.enabled=true \ --set devicePlugin.enabled=true
Verify GPUs are visible:
bashkubectl get nodes -l node-type=gpu kubectl describe node <gpu-node> | grep -A5 "Allocatable:" # Should show: nvidia.com/gpu: 1
Step 2: Model Storage
Store your model on S3 and use an init container to pull it, or use a PersistentVolume backed by EFS/NFS for shared access across replicas.
For EFS-based model storage (recommended for multi-replica setups):
yamlapiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: inference spec: accessModes: - ReadWriteMany # EFS supports multi-pod access storageClassName: efs-sc resources: requests: storage: 50Gi
Pre-populate the model:
bash# One-time: download model to PVC via a temporary pod kubectl run model-loader \ --image=python:3.11 \ --overrides='{"spec":{"volumes":[{"name":"models","persistentVolumeClaim":{"claimName":"model-storage"}}],"containers":[{"name":"model-loader","image":"python:3.11","command":["bash","-c","pip install huggingface_hub && python -c \"from huggingface_hub import snapshot_download; snapshot_download(repo_id='"'"'meta-llama/Llama-3.1-8B-Instruct'"'"', local_dir='"'"'/models/llama-3.1-8b'"'"')\""],"volumeMounts":[{"name":"models","mountPath":"/models"}]}]}}' \ --restart=Never
Step 3: vLLM Deployment
yamlapiVersion: apps/v1 kind: Deployment metadata: name: vllm-inference namespace: inference spec: replicas: 2 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: nodeSelector: node-type: gpu tolerations: - key: nvidia.com/gpu operator: Equal value: "true" effect: NoSchedule volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage - name: shm emptyDir: medium: Memory sizeLimit: 8Gi # shared memory for tensor parallel containers: - name: vllm image: vllm/vllm-openai:latest command: - python - -m - vllm.entrypoints.openai.api_server args: - --model=/models/llama-3.1-8b - --host=0.0.0.0 - --port=8000 - --tensor-parallel-size=1 # increase for multi-GPU - --max-model-len=8192 - --max-num-batched-tokens=32768 - --gpu-memory-utilization=0.90 - --served-model-name=llama-3.1-8b env: - name: VLLM_API_KEY valueFrom: secretKeyRef: name: vllm-secrets key: api-key ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: "1" memory: 32Gi cpu: "8" requests: nvidia.com/gpu: "1" memory: 24Gi cpu: "4" volumeMounts: - name: model-storage mountPath: /models readOnly: true - name: shm mountPath: /dev/shm readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 # model loading takes time periodSeconds: 10 failureThreshold: 30 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 180 periodSeconds: 30
Service and Ingress:
yamlapiVersion: v1 kind: Service metadata: name: vllm-service namespace: inference spec: selector: app: vllm ports: - port: 80 targetPort: 8000 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: vllm-ingress annotations: nginx.ingress.kubernetes.io/proxy-read-timeout: "300" nginx.ingress.kubernetes.io/proxy-send-timeout: "300" nginx.ingress.kubernetes.io/proxy-body-size: "10m" spec: rules: - host: llm-api.yourdomain.com http: paths: - path: /v1 pathType: Prefix backend: service: name: vllm-service port: number: 80
Step 4: Autoscaling with KEDA
Standard HPA cannot scale on GPU utilisation or request queue depth. Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus metric:
bashhelm repo add kedacore https://kedacore.github.io/charts helm upgrade --install keda kedacore/keda --namespace keda --create-namespace
Scale based on pending request queue (via Prometheus metric from vLLM):
yamlapiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: vllm-scaler namespace: inference spec: scaleTargetRef: name: vllm-inference minReplicaCount: 1 maxReplicaCount: 5 cooldownPeriod: 300 # seconds before scale-down (GPU nodes are expensive) triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc:9090 metricName: vllm_pending_requests query: sum(vllm_request_waiting_total) threshold: "10" # scale up when 10+ requests waiting
Step 5: Cost Controls
GPU instances are expensive. A g5.xlarge is ~$1/hour on On-Demand. Controls to implement:
Scale to zero when idle: Set minReplicaCount: 0 in KEDA. GPU nodes scale down when no inference is happening. New requests trigger a scale-up (30–60 second cold start for GPU node provisioning + model loading).
Use Spot GPU instances for non-latency-critical workloads: GPU Spot instances have lower interruption rates than CPU instances. For batch inference or background processing, Spot g5 instances save 60–70%.
Set request rate limits at the ingress layer:
yamlmetadata: annotations: nginx.ingress.kubernetes.io/limit-rps: "10" nginx.ingress.kubernetes.io/limit-connections: "50"
Calling the API
vLLM exposes an OpenAI-compatible API. Your application code needs zero changes if it already uses the OpenAI client:
pythonfrom openai import OpenAI client = OpenAI( api_key="your-vllm-api-key", base_url="https://llm-api.yourdomain.com/v1", ) response = client.chat.completions.create( model="llama-3.1-8b", messages=[ {"role": "user", "content": "Explain Kubernetes in one paragraph."} ], max_tokens=512, temperature=0.7, )
What This Costs
For a typical AI startup serving 1M tokens/day on Llama 3.1 8B:
- •1× g5.xlarge at ~$1/hr, running ~12 hours/day (KEDA scale to zero overnight)
- •~$360/month compute
- •Plus EFS storage (~$15/month for 50GB model)
- •Plus data transfer
Compare to OpenAI API at ~$0.15/1M input tokens + $0.60/1M output tokens - at 1M tokens/day you are paying $225–$750/month and have no data privacy guarantees.
Self-hosted LLM on Kubernetes makes financial and privacy sense from roughly 500K tokens/day onward.
Building an AI product and need production-grade LLM infrastructure? Book a free audit - we will scope the infrastructure for your specific model and traffic profile.