AI Startup: Production MLOps Platform in 6 Weeks
A computer vision startup had five models trained in Jupyter notebooks served by ad-hoc scripts on a single GPU instance. Models could not be versioned, retraining was manual, and two models had silently degraded for weeks. We built a production MLOps platform from scratch in 6 weeks.
The Challenge
The ML team was strong. The infrastructure was not. Models were trained locally, artifacts were copied to S3 with names like model_final_v2_REAL.pkl, served by a Flask app in a screen session on an EC2 instance, and retrained when someone remembered to. The engineering team wanted to scale to 20+ models - which was impossible with the current setup.
The Approach
We scoped the engagement around three missing capabilities: experiment tracking and model versioning, a repeatable training pipeline, and production monitoring with drift detection. We built the infrastructure layer around the existing ML work without redesigning any models.
The Implementation
MLflow experiment tracking and model registry
We deployed MLflow on Kubernetes with a PostgreSQL backend and S3 artifact store. The data science team instrumented their training scripts in two days. The model registry replaced the S3 bucket naming convention with a proper versioning and staging system.
Kubeflow training pipelines
We built Kubeflow Pipelines for the three highest-frequency training jobs. Each pipeline pulls versioned training data, runs preprocessing as a containerised step, trains on a GPU node, evaluates against a validation set, and registers the model if it meets the quality threshold.
BentoML model serving
We replaced the Flask scripts with BentoML-packaged model servers deployed as Kubernetes Deployments. Each model runs as an independent service pulling the production-stage model from the MLflow registry on startup.
Evidently drift monitoring
A daily job computes drift metrics between training distribution and live prediction inputs. Two models triggered alerts in the first week - both were retrained and redeployed within 48 hours.
Key Takeaways
- MLflow is the right first investment for any ML team - experiment tracking pays off immediately
- Training-serving skew is the hardest bug to debug and the easiest to prevent
- Drift monitoring found two underperforming models in the first week that the team had not noticed
- BentoML versioning means deployments are reproducible - the exact model artifact and dependencies are versioned together
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.