AI / ML2025-09

AI SaaS: GPU Inference Infrastructure for a Document Processing Platform

A document intelligence startup was serving OCR and NLP models from a single A100 instance via a Flask app in a tmux session. P99 latency was 8.4 seconds, the instance cost $14,000/month, and a single bad request could crash the model server. We rebuilt the inference layer on EKS with vLLM, request batching, and autoscaling - cutting latency by 78% and cost by 60%.

Deploy Time

Manual restart in tmux - 8 min, risky

Rolling deploy with warmup - 6 min, automated

Deploy Frequency

Avoided deploys (fear of downtime)

Weekly model updates

Incidents

P99 latency 8.4s, server crash on VRAM overflow

P99 latency 1.8s, zero VRAM crashes

Cost Impact

$8,400/month ($14K → $5.6K with scale-to-zero overnight)

The Challenge

The Flask inference server processed requests synchronously, one at a time, on a single A100. At the company's current request volume (peak 120 rps), the server was the primary bottleneck. The model was being fully loaded into VRAM on each worker startup - if gunicorn spawned a second worker to handle a spike, it ran out of VRAM and the server crashed. The team had tried Celery for async processing but it added latency without fixing throughput.

The Approach

The fix was not more GPU instances - it was better utilisation of the one they had. vLLM with continuous batching meant the GPU processed multiple requests simultaneously instead of sequentially. We containerised the inference server, deployed it on EKS with a GPU node pool, and added KEDA autoscaling keyed on request queue depth. Total effort: 3 weeks.

The Implementation

vLLM inference server with PagedAttention

We replaced the Flask+model.generate() pattern with vLLM's OpenAI-compatible server. PagedAttention enabled the GPU to process 12–18 concurrent requests in a single forward pass instead of one at a time. The same A100 that maxed out at 14 rps with Flask served 180 rps with vLLM - a 12× throughput improvement on identical hardware.

vLLMNVIDIA A100DockerKubernetes

EKS GPU node pool with NVIDIA operator

We containerised the vLLM server and deployed it on EKS with a g5.12xlarge node (4× A10G GPUs, 96GB VRAM total). The NVIDIA GPU Operator installed drivers automatically. We configured GPU sharing across vLLM's tensor parallel mode to split the model across all four GPUs, reducing per-GPU memory pressure.

AWS EKSNVIDIA GPU Operatorg5.12xlargeKarpenter

Request queue and KEDA autoscaling

We introduced SQS as a request queue between the API and the inference service. KEDA scaled the vLLM Deployment based on SQS queue depth - at 0 pending requests, 1 replica; at 50+, up to 3 replicas. GPU nodes scaled via Karpenter. During overnight low traffic, the GPU node scaled to zero, saving $4,200/month.

AWS SQSKEDAKarpentervLLM

Model warm-up and health checks

Cold start on a 7B parameter model takes 90 seconds. We added a pre-warming Job that runs before traffic is shifted to a new deployment, ensuring zero cold-start latency for users. The readinessProbe checks /health and only passes after the model is loaded - eliminating the previous pattern of requests arriving before the model was ready.

Kubernetes JobsvLLM health APIAWS ALB

Key Takeaways

vLLM's continuous batching delivers 10–20× throughput improvement over naive synchronous serving on the same hardware - it is not optional for production LLM serving
GPU scale-to-zero during low-traffic periods is the highest-ROI cost change for AI startups with predictable traffic patterns
SQS as a queue between API and inference enables autoscaling that direct HTTP cannot - queue depth is a better scaling signal than CPU
Model warm-up before traffic shift is non-negotiable - a cold deployment that receives requests while loading will fail under any real load

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies