MLOps, or DevOps for Machine Learning

Over the past year or so, I've been spending a lot of time in the world of machine learning and data science. Not as a data scientist, but as a DevOps practitioner*.

I wanted to know -

  • How do machine learning projects work?
  • Who is involved?
  • What is the process like from idea through to running in production?
  • Most importantly - can DevOps help?

I want to learn all the things

*As an aside, I tried about 4 different words before "practitioner". It feels so businessey. But here we are.

Learning Machine Learning

When I first encountered any kind of machine learning, it was at university in the form of "deep neural networks". It wasn't my project, but another student's.

He had great success with it, so I spoke to him about what was involved. After 5min, I was lost. It felt so inaccessible.

Non-linear activation functions, multi-dimensional calculus, stochastic gradient decent... I would have spent as much time with a dictionary as with a computer.

What even is this I'm so confused

So, as any good student would do... I dropped it and started my career writing Visual Basic for the government.

Fast forward a few years, and I'm working at Microsoft as a Cloud Advocate. I'm here because I've apparently convinced the company that I know about DevOps and can communicate those ideas well. Sure. I'll take it.

The great thing about being in this role is I have access to fantastic experts in other fields, who are great at communicating. Enter Seth Juarez.

If I had to put my knowledge of AI and ML on a scale against Seth's, it would look a bit like this:

Damo vs Seth - knowing stuff about ML

It's taken a while, and a lot of experimentation and conversations, but I feel like I've got a handle on what's involved in a machine learning project.

I mean, you probably shouldn't hire me to build a predictive model for you just yet, but I'm no longer completely clueless.

About that process

One of the things that all these conversations taught me was that the majority of ML projects don't have a great story when it comes to DevOps.

But let's back up a bit - isn't DevOps traditionally used to help deliver software?

Yes. And machine learning is software.

It's true. ML is software.

"MLOps", or "Machine Learning Operations" seems to be the term that much of the industry has landed on. I don't love it - it implies a narrow focus on operationalizing the results of a machine learning project and not the broader idea of creating value - but if that's the term used, I'll play along.

I really prefer "DevOps for Machine Learning". The reason? "DevOps" can absolutely apply to machine learning projects - without even stretching the definition.

DevOps

Let's quickly look at Microsoft's DevOps definition:

DevOps is the union of people, process, and products to enable continuous delivery of value to our end users

The important part of this definition is the word "value". DevOps isn't about features or bug fixes or infrastructure as code, it's about value. ML is software that provides value.

For example, if you're working on producing a predictive model for evaluating lending risk for a bank, what's more valuable than improving the accuracy of that model?

So valuable! High Five!

Back to the process

DevOps is full of great practices and patterns, and most of these can be applied to ML projects.

For me, the main differences between a machine learning project and a software development project come down to the following two distinctions:

There's a lot more experimentation

When writing a new feature in a piece of software, you generally know what you need to do up front.

For example, for a new "customer orders" report, you know there's probably a database query, some data processing logic, and a few screens or pages of some kind of reporting UI. There's a direction right away, even if you may have to try a few things.

ML Projects often involve a lot more experimentation. While the end goal is generally known, the path to that goal is often far less clear.

So many options it's hard to know which way to go

There are countless algorithms you could use and limitless ways to transform the data and select features. Undiscovered outliers, biases in the data, or poorly-tuned hyperparameters could push you off course for weeks.

Experience is important, and great data scientists have a feel for what will work, what to look for, and how to solve these tricky problems.

Data science is a specialist art form.

Data science is an art

All that said, the actual execution of these experiments is ripe for automation.

The training, scoring, evaluation and testing are all things that require scripts and processing. And DevOps helps us with this.

Training runs are like really expensive builds

When we build our software on a build server, it's usually a process measured in minutes. Maybe hours if we're unlucky. That means CI is definitely something we should do

Producing a predictive model on an ML project might take weeks.

Each training run may require many thousands of compute hours on expensive GPU-laden machines and cost a senior developer's salary each time.

Training can be expensive

So you may not want to do a full training run for every small change (like you would with continuous integration). You may want to do a run with a subset of data to prove the idea first. Or maybe you just want to run tests or linting (it is code after all)?

Even so, the idea of a "training run" as a parallel to a "build" makes sense to me.

A build is essentially a compute-intensive process that produces an artifact that can be deployed or used by other software.

A training run is the same - except it's also very data-intensive.

The same DevOps techniques around CI, pull request builds, and even security, performance, and load testing can apply.

Same same, but different

The skills required to produce a good result may be very different. The cadence of work may be slower at first, and less cleanly broken down.
The tools used day to day might be very different.

But ultimately it's all writing code, running automated tasks, and deploying the artifacts safely.

If it's not all in code and in source control, it should be!

If model training requires manual work, it should be automated!

If deployment is manual and scary, it should be automated and gradual!

Finally, if the project is "done" when the model is in production, it can get stale and stop delivering value. There's a great opportunity to continue delivering value by using telemetry, monitoring, and ideally automated retraining strategies.

Apply DevOps practices to your ML projects. Work on delivering value more effectively on a continuous basis.

Friends don't let friends deploy models manually

Find out more

There will be more blog posts drilling into some of these ideas in much more detail.

I've been speaking about MLOps or DevOps for Machine Learning at a few events recently, and there are more to come.

You can find out what Microsoft has been doing in the MLOps space on our Azure Machine Learning docs pages. We have some amazing tooling to help solve a lot of MLOps problems, and there's so much more to come!

Related events

Microsoft Ignite

One of the big events on the Microsoft calendar is Microsoft Ignite. It's less than a week away - November 4-8.

ms-ignite-logo

I'll be delivering a session on exactly this topic on Thursday afternoon. I'd love you to come along!

Also...

If you're going to Ignite and have an interest in MLOps, I'd love to speak to you!

I'll even have some of those "friends don't let friends" stickers in the image above. Just ask! ๐Ÿ˜€

Send me a message on Twitter, or just come by the Debug Bar in the Hub. I'll be there 11am-1pm Monday and Tuesday, from 1pm-4pm Wednesday, and from 12pm-2pm Thursday.

Microsoft Ignite The Tour

Microsoft Ignite The Tour is an event that Microsoft is running in 30 cities across the world, starting just after Ignite and running through to May next year.

The AIML50 session is focused completely on MLOps. I won't be delivering any of the tour sessions, but the presenters are amazing. Even though it won't be me saying the words and running the demos, the session is my work and I'm really proud of it.

Don't miss the (free) event when it comes to your city!

Damian Brady

I'm an Australian developer, speaker, and author specialising in DevOps, MLOps, developer process, and software architecture. I love Azure DevOps, GitHub Actions, and reducing process waste.

--