1. Introduction

“Data” in a Machine Learning (ML) project can mean many things:

The train/validation/test data needed to train/validate/test a machine learning model;
A special/proprietary tokenizer for Natural Language Processing (NLP) projects;
A not-so-small map between entities/variables;
An “auxiliary” model, usually static, as the ones used for generating embeds, object detection, etc;
and more…

In this blog post “data” will have this loose definition: any file/artifact that usually is not produced by the developer, having a very wide range of sizes (from KB in case of small data sets to GB in case of big NLP models).

With that in mind, a Data Scientist (DS) or Machine Learning Engineer (MLE) should have a solution to store, version, and deploy the data. In an enterprise setting these concerns becomes the questions:

Where is the data stored? How can we configure the access control?
Can we have version control? Do we even need that?
Deployment to Production and syncs with the Development environment. Do we need environments synced?

(click to expand) Suggestion when starting

When working on an ML project beyond a Proof of Concept (POC), consider building a list of all data dependencies (not only the external ones).

2. Do you need data version control?

From the questions raised in the introduction, you may be thinking: Do I even need such a thing as version control on all my data?

The not-so-surprising answer is: It depends.

What is the motivation for data version control then?

Incident recovery and rollbacks;
Model training and data pipelines reproducibility;
Bonus: doing a historic data analysis going into the past.

Ok ok, it seems like good capabilities to have, at least for the second point, a crucial one from the ML operations (MLOps) point of view. But having it for all data maybe be troublesome and add no real value. So, what are the cases in that version control is not necessary?

If the data doesn’t change frequently or doesn’t change at all;
If new data is only appended, not deleted or updated;
If the data changes frequently and the updates make more sense being modeled (e.g. receiving events from external sources that it is not reliable, it makes more sense to have all events in a table with their timestamp)

3. Tools for dealing with data in ML projects

With the objective of answering the 3 starting questions, some of the following tools you may consider.

3.1 Git, Git-LFS

Where is the data stored? On the GIT remote servers, being external providers (e.g. Github, Gitlab, Gitbucket, Bitbucket…) or self-provided.
Can we have version control? Very strong and familiar version control for developers.
Deployment and sync between environments? Given most projects relies on container images, having the data with the code is the easiest way to deploy it between environments, just needs to deploy the container image.

Other points to consider:

Strong versioning and very difficult history deletion;
Data format agnostic;
Container image size: depending on the data size the Continous Delivery (CD) pipelines will slow down and might break due to disk space usage;
Default data comparison: you will need to download both file versions and compare if the format is not human-readable. Examples of human-readable: CSV and JSON. Examples of not human-readable: binary and parquet;
Max file size of 100 MB on default git and 10 GB on git-lfs.

3.2 Data Version Control (DVC)

Where is the data stored? It accepts many backends (this is pretty awesome by the way).
Can we have version control? Yes, very git-like.
Deployment and sync between environments? Just like git you may do a “git pull” with a flag.

Other points to consider:

Git-like interface;
Data format agnostic;
Ability to use multiple backends;

Personal note: It is a great tool for migrating from the limitations of Git and Git-LFS, but not so good for big data environments.

3.3 Delta Tables (Databricks)

Where is the data stored? Multiple backends, usually some data lake (ADLS, S3, etc).
Can we have version control? Yes, delta history.
Deployment and sync between environments? Needs to be implemented.

Other points to consider:

Easy data comparison: For example, one can use in the same SQL query multiple table versions;
Can configure retain policy and when to run VACUUM commands, so we can control the period of the data history, consequently, its size;
Better suited to be used with Spark;
Easy table deletion history;
Fixed data format: tables.

3.4 Honorable mentions

MLflow: best option when thinking about versioning models;

Pachyderm: core feature is to run and version data-driven pipelines. Seems to want to do many things at once, too convoluted to be used to solve the starting three problem;

LakeFS: versions the whole data lake, seems like a tool for a company’s data teams.

Dolt: a SQL database that feels like a git repository. The problem is that it is a database in itself, too big of a solution.

4. Conclusion

Given my experiences and analysis of this article, my rules of thumb are:

For data below 100 MB use GIT, for bigger than that, although you could use GIT-LFS, it slows down and may break CD pipelines. GIT is robust and time-proven, also, deploying data in environments is automatic as we usually build container images with all files of the code repository.
For bigger data, give preference to Delta Tables as we can control table history range and consequently its size. As our team usually works with Spark, DVC does not show to be too much compelling when compared to Delta tables.