MLOps [6] - How does MLOps differ from DevOps?

This video and post are part of a One Dev Question series on MLOps - DevOps for Machine Learning. See the full video playlist here, and the rest of the blog posts here.

While machine learning projects and traditional software development have many similarities, there are some important differences. This means blindly applying the practices of one to the other is unlikely to be successful.

In the last blog post I talked about how the day-to-day work of a data scientist is usually very different than the day-to-day work of a traditional software engineer. Even so, the processes around that work can benefit greatly from DevOps practices, techniques, and ideas.

To understand where MLOps differs from DevOps, it's worth drilling into that first point - what is the difference when it comes to the day-to-day work?

Two female developers collaborating in the office

Experimentation

The first difference is to do with experimentation. While it's not unusual for a traditional software engineer to experiment and try things out, these "spikes" are usually relatively short-lived and isolated from the main project. It's rare that a developer will do a week's worth of work, realise it's not the right direction, then abandon it. More importantly, these spikes generally don't need to be kept around or tracked. They're not pertinent to the rest of the work going on.

That's not the case in machine learning projects. It's quite common for a data scientist to experiment with solving a problem a particular way, only to stop and try another method several days or weeks down the track.

Of course there are tools to help with this experimentation process. Azure Machine Learning has an automated machine learning feature allowing you to run a whole lot of experiments in parallel and compare them to get the best performing models quicker.

Regardless of how you run all these experiments, the very point of experimentation is to find those methods that work and those that don't. The "failures" are very rarely mistakes, rather they represent the reality of machine learning. You're trying to solve a problem where you can't personally envisage the solution.

It actually represents a more agile way of working than we usually see with traditional software development. You frequently don't know what you'll find until you try something, and it's important to be able to change course if it's the sensible thing to do.

But if it might take weeks of experimentation to find a feasible path, does that mean all the failed attempts should be discarded? Well, no. It's important that these experiments are still tracked for a few reasons. It can be helpful to the rest of your team to see what's been tried and what went wrong. Alternatively, these experiments might be more successful than you originally thought and worth another look later in the project. Ultimately, an "experiment" represents a method used to find a solution, and they're all important in exactly the same way as any other scientific field.

A female engineer in a sound studio. Photo by ThisIsEngineering from Pexels

Working with data

As the name suggests, data scientists work primarily with data. And data is unquestionably the most important component of a machine learning project. Erroneous, insufficient, or unrepresentative data leads to bad predictive models. In other words, data that doesn't accurately reflect the real world will ultimately lead to models that don't work in the real world. It's also important to realise and acknowledge that the machine learning process can perpetuate and even exaggerate biases that exist in that data.

Yes, there is code. But if you put all the code required for an ML project alongside all the data required, it's likely to be a fraction of a fraction of a percent.

However, with traditional software development, it's all about the code. There may be data involved, but all the work is focused around the code. And managing code and managing data are quite different.

A team working with and analysing data

Aspects like versioning and reproducibility are much easier to handle with code than they are with data. Source control for a even several hundred megabytes of text-based files is comparatively trivial when contrasted with terabytes of data for an ML project.

What this means in practice is easier provenance. With established source control and good CI/CD practices, it's relatively easy to say this code created this artefact which was deployed to this environment. A commit SHA can be all you need to identify that whole chain from code change to running application.

But it's not quite so easy when there's huge amounts of data in the mix. You also have to account for what the data looked like at the time you trained your model. Getting a snapshot of the code (including infrastructure) at a point in time can be as simple as doing a git checkout [sha], but getting a snapshot of data at a point in time depends very much on the way it's been stored. If you're lucky and have considered this with your up-front data engineering, it can be a simple query. If not, it can be borderline impossible. This makes provenance harder, and explainability of a model is extremely important in many implementations.

With established source control and good CI/CD practices, it's relatively easy to say this code created this artefact which was deployed to this environment... But it's not quite so easy when there's huge amounts of data in the mix.

With DevOps, using source control and CI/CD to enable collaboration, reproducibility, and provenance is standard practice.

With MLOps, a lot more thought and planning may need to go into data engineering if you truly want these same benefits.

An aisle in a data centre

CI/CD

I mentioned good CI/CD practices, and CI/CD is one of the other areas where MLOps can differ greatly from DevOps - particularly the continuous integration part.

A typical build for a traditional software engineering project might take a few minutes. Maybe an hour for very large projects. Additionally, most builds can be performed on relatively low-spec machines or docker containers.

At that speed and with those resource requirements, it's feasible to do a full build for every commit that's made - continuous integration. Every change can be committed and there's a very fast feedback loop telling the engineer whether the change they made is "good".

With a machine learning project, a training run may take days or weeks, and might require a powerful machine with multiple GPUs. Even with massively scalable compute offered by the cloud, full training runs can be extremely expensive and time-consuming. It's simply not feasible to do a full training run for every code or data change. Even if money was no object, getting feedback on a change you made would be far from immediate. You lose the connection between cause and effect.

A person probing a circuit board with a lot of wires. Image from Pexels.

The good news is you can still stay true to the fundamental goals of DevOps without doing a full training run on every change. The idea of "shifting left" - where you try to discover issues as early as possible in the lifecycle of a change - can still be applied pragmatically.

For example, you might want to implement continuous integration that runs quality checks and tests over your code, and maybe does a mini training run with a very limited set of data - just to catch any glaring issues. If that succeeds, the next stage might be a training run over a larger set of data that you abandon after an hour, ensuring that the metrics are still heading in the right direction. Then a nightly training run that takes a few hours with a full evaluation, then a weekly run over all the data and so on.

Ultimately, you can control the balance between fast feedback and the cadence for your production-ready models.

A programmer working on multiple monitors at night.

For me, these are the three key areas to be aware of when applying DevOps practices to machine learning projects:

  1. Treating experimentation as a necessary and important part of the work
  2. Being conscious of data engineering and versioning to cater for effective collaboration, repeatability, and explainability
  3. Applying continuous integration pragmatically to avoid unreasonable cost and time delays

One tool that's extremely good at managing the MLOps lifecycle is Azure Machine Learning. As well as features that make creating models easier than ever, it can manage pipelines, track experiment and model versions, and even package and deploy to production.

If you want to get started with Azure Machine Learning, the best place to get hands-on experience (for free) is Microsoft Learn. The "Build AI solutions with Azure Machine Learning" learning path is an awesome in-depth resource.

And of course, there's the usual collection of comprehensive documentation on Microsoft Docs to answer any additional questions you have along the way.