Production ML is fundamentally different from research ML. The models are the easy part — it's everything around them that determines success. After building and maintaining production ML systems across multiple organizations, we've distilled our approach into a set of principles that consistently produce reliable, maintainable pipelines.
Data Quality Gates
Every pipeline needs automated data quality checks before training data reaches your models. Without these gates, a single bad data batch can silently degrade model performance for weeks before anyone notices. We implement three layers of validation:
- Schema validation: Does the data match the expected format? Are required fields present? Are data types correct? This catches integration failures and upstream schema changes immediately.
- Statistical validation: Are feature distributions within expected ranges? Has the volume of incoming data changed dramatically? We use statistical tests to detect distribution shifts that might indicate upstream problems.
- Business logic validation: Do the values make sense in a business context? A negative revenue figure passes schema validation but should trigger an alert. These rules encode domain knowledge that statistical tests alone can't capture.
Model Monitoring and Drift Detection
Drift detection, performance tracking, and automated alerting are non-negotiable in production. Models degrade over time as the real world changes, and without continuous monitoring you're flying blind.
We track four categories of drift:
- Feature drift: Changes in input data distributions that may indicate upstream issues or genuine shifts in the population being served.
- Prediction drift: Changes in the distribution of model outputs, even when inputs appear stable. This often catches subtle model degradation early.
- Concept drift: Changes in the underlying relationship between features and targets. This is the hardest to detect and usually requires ground-truth labels to confirm.
- Performance drift: Degradation in business KPIs that the model is optimizing. This is the ultimate measure but often has the longest feedback loop.
Versioning Everything
In production ML, you need to reproduce any historical prediction. That means versioning not just the model weights, but the training data, feature engineering code, hyperparameters, and serving configuration. We use a combination of Git for code, DVC or similar tools for data and model artifacts, and experiment tracking platforms to link everything together into a reproducible lineage.
Automated Retraining
Manual retraining doesn't scale. We implement automated retraining pipelines that trigger based on drift detection thresholds or scheduled intervals. Every retrained model goes through the same validation suite — automated tests on held-out data, performance benchmarks against the current production model, and canary deployments before full rollout.
Incident Response for ML Systems
When a model fails in production, the response needs to be as structured as any other incident. We help teams build ML-specific runbooks covering common failure modes: data pipeline outages, model performance degradation, serving infrastructure failures, and the decision framework for when to roll back to a previous model version versus serving a rule-based fallback.