AI SaaS: GPU Inference Infrastructure for a Document Processing Platform
A document intelligence startup was serving OCR and NLP models from a single A100 instance via a Flask app in a tmux session. P99 latency was 8.4 seconds, the instance cost $14,000/month, and a single bad request could crash the model server. We rebuilt the inference layer on EKS with vLLM, request batching, and autoscaling - cutting latency by 78% and cost by 60%.
Deploy timeManual restart in tmux - 8 min, riskyRolling deploy with warmup - 6 min, automated Deploy freqAvoided deploys (fear of downtime)Weekly model updates P99 latency: 8.4s → 1.8s, cost: $14K → $5.6K/monthRead