Rob Colantuoni

August 15, 2017

Tags: Machine Learning, AI, and Infrastructure

ML Is About to Leave the Lab

The infrastructure gap is the real story

I’ve been watching the machine learning space pretty closely over the past couple of years. The research is incredible – image recognition, NLP, reinforcement learning, all making huge strides. But the thing that keeps nagging at me isn’t the research. It’s the gap between what’s possible in a Jupyter notebook on a researcher’s laptop and what most organizations can actually put into production.

That gap is about to close. And when it does, the implications are going to be much bigger than most people expect.

The handoff problem

Here’s what ML looks like at most companies right now: a small data science team builds models in notebooks, trains them on whatever GPUs they managed to get budget for, evaluates against some held-out test set, and then… throws the model over the wall to an engineering team and says “productionize this.”

That handoff is where everything falls apart. The model was built in Python 3.5 with specific library versions that nobody documented. It expects input data in a format that doesn’t match the production schema. It was trained on a static dataset that’s already stale. There’s no monitoring for when the model’s predictions start drifting. No rollback mechanism. No way to A/B test a new model against the old one.

So most ML models either never make it to production, or they make it there once and then silently degrade until someone notices months later that the recommendations are garbage.

What’s missing

The missing piece is infrastructure. We need the same kind of tooling and discipline for ML models that we’ve built up over the years for regular software. I’ve started calling this the “MLOps” layer in conversations, though I’m not sure that term is going to stick. (Edit from the future: it stuck.)

You need reproducibility – version the data and the training environment, not just the code. You need continuous training pipelines so models get retrained on fresh data automatically. You need monitoring that goes beyond the usual latency and error rate stuff to track things like prediction distribution shift and feature drift. And you need a serving layer that can handle batching, caching, versioning, and graceful degradation. That last part is a systems engineering problem, not a data science problem.

Why I think this is about to change

A few things are converging. Cloud GPUs are getting accessible enough that you don’t need to buy your own hardware to train a model. Transfer learning means you can take a big pre-trained model and fine-tune it for your specific task with way less data than you’d need from scratch – which brings ML within reach of companies that don’t have Google-scale datasets. Containers are making the deployment story tractable (if you can package your model and its dependencies in a Docker image, you’ve solved half the ops problem). And TensorFlow and PyTorch are maturing fast enough that the gap between “research code” and “production code” is shrinking.

My prediction

Five years from now, ML is going to be a standard part of the software engineering toolkit. Not every app will use it, but every org of any real scale will have models running in production. And the discipline of operating those models will be as established as DevOps is today.

The organizations that invest in ML infrastructure now are going to have a big advantage. Not because they’ll have better models – the algorithms are increasingly a commodity – but because they’ll be able to deploy, monitor, and iterate on their models faster and more reliably. The winners aren’t going to be the ones with the best papers. They’re going to be the ones with the best infrastructure.

A note for software engineers

If you’re a software engineer who’s been looking at ML from the sidelines, this is your on-ramp. The skills you need to build ML infrastructure – distributed systems, data pipelines, monitoring, deployment automation – are the same skills you already have. You don’t need a PhD. You need to understand systems. Start playing with TensorFlow Serving. Set up a training pipeline in containers. Build monitoring for model predictions. The tooling is rough right now, and that’s exactly where the opportunity is.