AI SaaS: GPU Inference Infrastructure for a Document Processing Platform
A document intelligence startup was serving OCR and NLP models from a single A100 instance via a Flask app in a tmux session. P99 latency was 8.4 seconds, the instance cost $14,000/month, and a single bad request could crash the model server. We rebuilt the inference layer on EKS with vLLM, request batching, and autoscaling - cutting latency by 78% and cost by 60%.
The Challenge
The Flask inference server processed requests synchronously, one at a time, on a single A100. At the company's current request volume (peak 120 rps), the server was the primary bottleneck. The model was being fully loaded into VRAM on each worker startup - if gunicorn spawned a second worker to handle a spike, it ran out of VRAM and the server crashed. The team had tried Celery for async processing but it added latency without fixing throughput.
The Approach
The fix was not more GPU instances - it was better utilisation of the one they had. vLLM with continuous batching meant the GPU processed multiple requests simultaneously instead of sequentially. We containerised the inference server, deployed it on EKS with a GPU node pool, and added KEDA autoscaling keyed on request queue depth. Total effort: 3 weeks.
The Implementation
vLLM inference server with PagedAttention
We replaced the Flask+model.generate() pattern with vLLM's OpenAI-compatible server. PagedAttention enabled the GPU to process 12–18 concurrent requests in a single forward pass instead of one at a time. The same A100 that maxed out at 14 rps with Flask served 180 rps with vLLM - a 12× throughput improvement on identical hardware.
EKS GPU node pool with NVIDIA operator
We containerised the vLLM server and deployed it on EKS with a g5.12xlarge node (4× A10G GPUs, 96GB VRAM total). The NVIDIA GPU Operator installed drivers automatically. We configured GPU sharing across vLLM's tensor parallel mode to split the model across all four GPUs, reducing per-GPU memory pressure.
Request queue and KEDA autoscaling
We introduced SQS as a request queue between the API and the inference service. KEDA scaled the vLLM Deployment based on SQS queue depth - at 0 pending requests, 1 replica; at 50+, up to 3 replicas. GPU nodes scaled via Karpenter. During overnight low traffic, the GPU node scaled to zero, saving $4,200/month.
Model warm-up and health checks
Cold start on a 7B parameter model takes 90 seconds. We added a pre-warming Job that runs before traffic is shifted to a new deployment, ensuring zero cold-start latency for users. The readinessProbe checks /health and only passes after the model is loaded - eliminating the previous pattern of requests arriving before the model was ready.
Key Takeaways
- vLLM's continuous batching delivers 10–20× throughput improvement over naive synchronous serving on the same hardware - it is not optional for production LLM serving
- GPU scale-to-zero during low-traffic periods is the highest-ROI cost change for AI startups with predictable traffic patterns
- SQS as a queue between API and inference enables autoscaling that direct HTTP cannot - queue depth is a better scaling signal than CPU
- Model warm-up before traffic shift is non-negotiable - a cold deployment that receives requests while loading will fail under any real load
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.