Xvc for Machine Learning

Machine learning projects accumulate large datasets, trained models, and the scripts that connect them. Over time it becomes hard to answer simple questions: which model was trained on which version of the data, and which preprocessing produced it? Xvc versions data and models alongside your code and can rerun the steps that turn one into the other, so your experiments stay reproducible.

This guide shows how to version data and models and how to build a pipeline that retrains a model when its inputs change. It uses an image-classification project as a running example.

Initialize the repository

Xvc works on top of Git. In your project directory, initialize both:

$ git init
$ xvc init

Xvc follows the Unix convention of staying quiet on success. If you want more detail about what a command is doing, add -v; repeat it (-vv, -vvv) to increase the verbosity.

By default, Xvc commits the metadata it creates to Git for you, so you don't have to stage or commit anything under .xvc/ yourself.

Track data as links to save space

Datasets are often large and read-only during training, so copying every file into your workspace wastes space. Track them as symlinks instead:

$ xvc file track --as symlink .

Xvc stores one copy of each file's content in its cache and links every occurrence in the workspace to that copy. This deduplicates identical files automatically — useful when a dataset contains many duplicates. To track only a specific file or directory, name it instead of ..

Choosing a recheck method

The way a workspace file points at the cache is called its recheck method. Xvc supports four:

copy (the default) duplicates the content as a normal, writable file. Use it for files that change.
symlink creates a read-only symbolic link to the cache. Links are easy to spot and there is no limit on how many can point to the same content.
hardlink creates a read-only hardlink. Like symlinks, it consumes no extra space, but hardlinks are harder to tell apart from regular files.
reflink creates a writable copy-on-write copy on filesystems that support it and Xvc is built with the reflink feature. It behaves like a copy but shares data blocks until the file is modified.

Xvc records the recheck method per file, so different files can use different methods. For example, if models should be writable, track the models/ directory as copies while the data stays in read-only symlinks:

$ xvc file track --recheck-method copy models/

This replaces the previous links with copies only for the files under models/.

Keep scripts in Git

Your scripts can live in the same repository as the data and models they use, so they are versioned together. Track them with ordinary Git commands. Xvc does not track files that Git already tracks, so the two tools stay out of each other's way.

Build a pipeline

An Xvc pipeline is a set of steps that Xvc reruns only when their inputs change. Consider a pipeline that preprocesses images, trains a model, evaluates it, and deploys it when the new model is the best so far:

graph LR
    cats["data/cats/"] --> pp_train["preprocess-train"]
    cats --> pp_test["preprocess-test"]
    pp_train --> train["train"]
    params["params.yaml"] --> train
    ratings["cat-contest.csv"] --> train
    train --> model["models/model.bin"]
    model --> test["test"]
    pp_test --> test
    test --> best["best-model.json"]
    best --> deploy["deploy"]

Xvc has a default pipeline, so you can start adding steps right away. Create additional pipelines with xvc pipeline new if you need them.

Create the steps

Each step has a name and a command. Create them with xvc pipeline step new:

$ xvc pipeline step new --step-name preprocess-train --command 'python3 src/preprocess.py --train data/cats data/pp-train/'
$ xvc pipeline step new --step-name preprocess-test  --command 'python3 src/preprocess.py --test  data/cats data/pp-test/'
$ xvc pipeline step new --step-name train --command 'python3 src/train.py data/pp-train/'
$ xvc pipeline step new --step-name test  --command 'python3 src/test.py data/pp-test/ metrics.json'
$ xvc pipeline step new --step-name deploy --command 'python3 src/deploy.py models/model.bin /var/server/files/model.bin'

Declare dependencies

A step reruns when any of its dependencies change. Xvc offers many dependency types; this example uses globs, files, a parameter, and a regular expression. Attach them with xvc pipeline step dependency:

$ xvc pipeline step dependency --step-name preprocess-train --glob 'data/cats/*' --file src/preprocess.py
$ xvc pipeline step dependency --step-name preprocess-test  --glob 'data/cats/*' --file src/preprocess.py
$ xvc pipeline step dependency --step-name train --glob 'data/pp-train/*' --file src/train.py --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'
$ xvc pipeline step dependency --step-name test  --glob 'data/pp-test/*'  --file src/test.py
$ xvc pipeline step dependency --step-name deploy --file best-model.json

A few things worth noting:

A --param dependency watches a single value in a JSON, YAML, or TOML file. The train step reruns when learning_rate changes, even if other values in params.yaml do not.
A --regex dependency watches only the lines of a file that match a pattern — here, the five-star ratings in cat-contest.csv. You can also depend on specific line ranges with --lines.
Quote globs and regular expressions so your shell passes them through unexpanded; Xvc expands them itself.

Declare outputs

Record what each step produces with xvc pipeline step output. Outputs let Xvc rerun a step when its output is missing and let later steps depend on it:

$ xvc pipeline step output --step-name train --output-file models/model.bin
$ xvc pipeline step output --step-name test  --output-metric metrics.json --output-file best-model.json

Because test depends on models/model.bin and train produces it, Xvc orders the two steps automatically.

Run the pipeline

Run the whole pipeline with xvc pipeline run:

$ xvc pipeline run

Xvc sorts the steps, checks the dependency graph for cycles (which are not allowed), and runs steps in parallel when they don't depend on each other. In this example, preprocess-train and preprocess-test can run at the same time.

Inspect and edit the pipeline

See the pipeline as a diagram with xvc pipeline dag, which prints a Mermaid or Graphviz graph you can paste into your notes or a pull request:

$ xvc pipeline dag --format mermaid

To edit a pipeline in bulk, export it to JSON or YAML, change it in your editor, and import it back. Importing under a different name is a convenient way to test a variation before running it.

Next steps

Add shell completions so that TAB completes step names, storage names, and Xvc-tracked paths.
See Create a Data Pipeline for a complete, runnable end-to-end example.
The xvc pipeline step dependency reference lists every dependency type.

The Xvc Book