MLOps [7] - What parts of an ML project should be in source control?
This video and post are part of a One Dev Question series on MLOps - DevOps for Machine Learning. See the full video playlist here, and the rest of the blog posts here.
With traditional software development, it's important that everything goes in source control. The same can be said for ML projects, but there are some important considerations!
Any DevOps evangelist will tell you everything should go in source control. Your code, infrastructure definitions, dev environment setup, database schema, even lookup data. And this is generally good advice.
The problem comes when this data is too large to be reasonably stored in traditional source control.
When Microsoft moved the Windows codebase to a git repository, they ultimately had to write a whole new "virtual file system" (VFS for Git) to handle the size. As it stood, git couldn't handle the 300 GB size of the repository. Many commands would take upwards of an hour to run.
300 GB is not a particularly large amount of data for many ML projects. As mentioned previously in this series, machine learning projects rely very heavily on data as well as the code. That data is often massive. Like, petabytes massive. This much data just won't work in a Git repository. Even with a virtual file system, you'll still have to pull all those files down to do a training run. It's just not feasible.
Everything except the data
There are solutions for data - which we'll get to - but to be clear, everything else should be in source control.
That means:
- All the code required to train your model
- Code for testing and validation of your code and trained models
- Definition of your data and training pipelines
- Definition and configuration of your training environment
- Definition and configuration of your inferencing/production environment
- Maybe a subset of the data so data scientists can prove out ideas locally
Ultimately, your source repository should contain everything a data scientist could reasonably require to start working on your project and be productive.
Another (darker) way of thinking about it is what you'd lose if your local machine, training environments, and production environments all stopped working. Could you get started again quickly? Or at all?
Keeping everything in source control also allows you to go back in time. This is important for explainability or diagnosing issues, and you may not know that there's a problem with a model in production until well after deployment. At that point, your code and data likely look very different than what's currently. If there is an issue - for example with bias, harmful predictions, or even just unexpected failures - it's important to be able to identify what caused that issue.
What about the data?
As I mentioned, there are solutions for storing and versioning large amounts of data.
From a pure data-versioning standpoint, products like DVC and Pachyderm are good options. They allow you to specifically attach versions to your data at specific points in time.
If you're using Databricks (on Azure or elsewhere), you can make use of Delta Lake to give you "time travel".
If you're using Azure Machine Learning, you can also use the built-in versioning capabilities of datasets - a native part of Azure ML.
And of course if your data is stored elsewhere in Azure, you can make use of built-in versioning capabilities like Blob versioning or Point-in-time Restore for Azure SQL.
One final note - many of these options require some thought when it comes to data engineering in the first place. One common strategy is making your data immutable and append-only. You can't modify, you only write new records and timestamp them appropriately. That allows you to run queries to identify the state of the data at any point in time. Event Sourcing is one popular way to implement this, where the focus is on ordered state changes rather than the current state.