Introduction to Xvc

Xvc is a command line utility to track large files with Git, define dependencies between files to run commands when only these dependencies change, and run experiments by making small changes in these files for later comparison. It's used mostly in Machine Learning scenarios where data and model files are large, code files depend on these and experiments must be compared via various metrics.

Xvc can use S3 and compatible cloud storages to upload tracked files with their exact version and can retrieve these later. This allows to delete them from the project when they are not needed to save space and get them back when needed. This facility can also be used for sharing these files. You can just clone the Git repository and get only the necessary Xvc-tracked files.

Xvc tracks files, directories and other elements by calculating their digests. These digests are used as address to store and find their locations in the storages. When you make a change to a file, it gets a new digest and the changed version has a new address. This makes sure that all versions can be retrieved on demand.

Xvc can be used as a make replacement to build multi-file projects with complex dependencies. Unlike make that detect file changes with timestamps, Xvc checks the files via their content. This reduces false-positives in invalidation.

Xvc pipelines are used to define steps to reach to a set of outputs. These steps have commands to run and may (or may not) produce intermediate outputs that other steps depend. Xvc pipelines allows steps to depend on other steps, other pipelines, text and binary files, directories, globs that select a subset of files, certain lines in a file, certain regular expression results, URLs, (hyper)parameter definitions in YAML, JSON or TOML files as of now. More dependency types like environment variables, database tables and queries, S3 buckets, REST query results, generic CLI command results, Bitcoin wallets, Jupyter notebook cells are in the plans.

For example, Xvc can be used to create a pipeline that depends on certain files in a directory via a glob, and a parameter in a YAML file to update a machine learning model. The same feature can be used to build software when the code or artifacts used in the software change. This allow binary outputs (as well as code inputs) to be tracked in Xvc. Instead of building everything from scratch in a new Git clone, a software project can reuse only the portions that require a rebuild. Binary distributions become much simpler.

This book is used as the documentation of the project. It is a work in progress as Xvc, and contain outdated information. Please report any errors and bugs in https://github.com/iesahin/xvc as the rest of project.

Comparison with other tools

There are many similar tools for managing large files on Git, managing machine learning pipelines and experiments. Most of ML oriented tools are provided as SaaS and in a different vein than Xvc.

Similar tools for file management on Git are the following:

  • dvc: See Xvc for DVC Users and Benchmarks against DVC documents for a detailed comparison.
  • git-annex: One of the earliest and most successful projects to manage large files on Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar to xvc storage new generic. It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a recheck method It doesn't have data pipeline features.
  • git-lfs: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses .gitattributes mechanism to track certain files by default. It doesn't have data pipeline features.

Installation

Rust

Linux

macOS

Windows

Compiling Xvc without default features

You may want to customize the feature set when you want a smaller binary size. Not everyone needs all storage options and turning off them may result in smaller binary sizes.

When you turn off all remote storage features, async runtime (tokio) is also excluded from binary.

cargo build --no-default-features --release
[..]
    Finished `release` profile [optimized] target(s) in 4.65s

[reflink] crate may cause compilation errors on platforms where it's not supported.

Xvc adds a reflink feature flag that's turned on by default. When reflink causes errors, you can turn off default features and select only those you'll use.

cargo build --no-default-features --features "reflink" --release
[..]
    Finished `release` profile [optimized + debuginfo] target(s) in 56.40s

Note that when you supply --no-default-features, all other default features like s3 etc are also turned off. You'll have to specify which features you want in the features list. Otherwise Xvc cannot connect to your storages.

cargo build --no-default-features --features "s3,wasabi" --release
[..]
    Finished `release` profile [optimized + debuginfo] target(s) in 56.40s

Configuration

Configuration Files

Configure with Environment Variables

Changing configuration for a command

Get Started to Xvc

Xvc is a multipurpose tool. Its features can be used by professionals with various roles. If you're working with data, you can benefit from Xvc data management features.

Xvc for Everyone

Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).

🐇 Hello tortoise. How are you? Let's take a selfie. Do you take selfies? I have lots of them. Terabytes of them.

🐢 I don't have much selfies, you know. I don't change quickly and scenery is changing less often.

🐇 I see. I have terabytes of them, but can't find a good solution to store them. How do you store your documents? I know you have documents, lots and lots of them.

🐢 I track them with Git to track my evolving thoughts on text files. Images are different. I think it's not a good idea to keep images on Git, but there is a tool for that.

🐇 What kind of tool? Not Git, but something different?

🐢 It's called Xvc. You can keep track of your selfies with it. You can backup them, and get them as needed.

🐇 Tell me more about it. I have a directory in my home, ~/Selfies and I have thousands of them. How will I start?

🐢 Xvc can be used as a standalone tool but better when used with Git. You can just type

$ git init
$ xvc init

to start working with Xvc.

🐇 It looks easy but I heard that Git is complicated. Will I need to learn it?

🐢 Ah, no. If you're not willing to learn Git, you can just let Xvc to handle that. By default, it handles all Git operations about the changes it makes. If you want to push your files with someone, you may need to learn how to manage a repository.

🐇 How do I track my files?

🐢 You use xvc file track command. Do you have directories in ~/Selfies?

🐇 Yep. I have. Lots of them.

🐢 Do you want to track all of them?

🐇 Almost all. Some of them are so private that I want to hide even from Xvc.

🐢 You can use .xvcignore file to list them. Xvc ignores the files you list in .xvcignore.

🐇 How do I add others? Could you give an example?

🐢 If you have a folder for today's selfies, type this in ~/Selfies

$ xvc file track today/

and Xvc will track everything in that directory.

🐇 Oh, that's easy. If I want to track everything not ignored, I can type xvc file track then.

🐢 You're a quick learner.

After some brief period 🐇 went to home and added files.

🐇 Now, I want to learn how to share my selfies.

🐢 Xvc can store file contents in another location. First you must setup a storage. Do you use AWS S3?

🐇 Yes. I have buckets there. I want to keep my selfies in my rabbit-hole.

🐢 You can configure Xvc to use it with xvc storage new s3 command. You'll specify the region and bucket, and Xvc will prepare it.

🐇 types

$ xvc storage new s3 --name selfies --region eu-lepus-1 --bucket rabbit-hole

🐢 Now, you can send your files there with xvc file send --to selfies.

🐇 Is that all?

🐢 You will also need to push your Git files to another place. Do you have a Github account?

🐇 Ah, yeah, I have.

🐢 Now create a repository for your selfies. We will configure Git to use it as origin.

$ git remote add origin https://github.com/🐇/selfies
$ git push --set-upstream origin main

Now, you can share your selfies with your friends.

🐇 Cool, but how Xvc knows my AWS password? Does it share my passwords?

🐢 No, never. You must allow your friends to read that bucket of yours. Xvc reads the credentials from AWS configuration, either from the file or the environment variables.

🐇 How will they get my files?

🐢 First, they must clone the repository.

$ git clone https://github.com/🐇/selfies

Then, they can get all files with:

$ cd selfies
$ xvc file get .

🐇 Oh, cool, they don't have to xvc init again? Right?

🐢 No, they don't. Xvc should be initialized only once per repository. When you have new selfies, you can share them with:

$ xvc file track
$ git push

and your friends can receive the changes with

$ git pull
$ xvc file get

🐇 The order of these commands are important, it looks.

🐢 Yep. You add to Xvc first. Xvc automatically commits the changes to Git. Then you push Git changes to remote. Your friends first pull these changes, then get the actual files.

🐇 Thank you tortoise. Let me get back to my hole.

Xvc for Data

Xvc for Machine Learning

Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).

🐇 Ah, hello tortoise. How are you? I began to work as an machine learning engineer, you know? I'll be the fastest.

🐢 You're quick as always, hare. How is your job going so far?

🐇 It's good. We have lots and lots of data. We have models. We have scripts to create those models. We have notebooks full of experiments. That's all good stuff. We'll solve the hare intelligence problem.

🐢 Sounds cool. Aren't you losing yourself in all these, though?

🐇 Time to time we have those moments. Some models work with some data, some experiments require some kind of preprocessing, some data changed since we started to work with it and now we have multiple versions.

🐢 I see. I began to use a tool called Xvc. It may be of use to you.

🐇 What does it do?

🐢 It keeps track of all these stuff you mentioned. Data, models, scripts. It also can detect when data changed and run the scripts associated with that data.

🐇 That sound something we need. My boss wanted me to build a pipeline for cat pictures. He makes a contest for cat pictures. Every time he finds a new cat picture he likes, we have to update the model.

🐢 He must have lots of cat pictures.

🐇 He has. He sometimes find higher resolution versions and replaces older pictures. He has terabytes of cat pictures.

🐢 How do you keep track of those versions?

🐇 We don't. We have a disk for cat pictures. He puts everything there and we train models with it.

🐢 You can use Xvc to version those files. You can go back and forth in time, or have different branches. It's based on Git.

🐇 I know, but Git is for code files, right? I never found a good way to store image files in Git. It stores everything.

🐢 Yep. Git keeps all history in each repository. Better to keep that terabytes of images away from Git. Otherwise, you'll have terabytes of cat pictures in each clone you use. Xvc helps there. It tracks contents of data files separately from Git. Image files are not put into Git objects, and they are not duplicated in all repositories.

🐇 You know, I'm not interested in details. Tell me how this works.

🐢 Ok. When you go back to cat picture directory, create a Git repository, and initialize Xvc immediately.

$ git init
...
$ xvc init
? 0

🐇 No messages?

🐢 Xvc is of silent type of Unix commands. It follows "no news is good news" principle. We use ? 0 to indicate the command return code. 0 means success. If you want more output, you can add -v as a flag. Increase the number of -vs to increase the details.

🐇 So -vvvvvvvvvvvvvvv will show which atoms interact in disk while running Xvc?

🐢 It may work, try that next time. Now, you can add your cat pictures to Xvc. Xvc makes copies of tracked files by default. I assume you have a large collection. Better to make everything symlinks for now. We can change how specific files are linked to cache later.

$ xvc -v file track --as symlink .

🐇 Does it track everything that way?

🐢 Yes. If you want to track only particular files or directories, you can replace . with their names.

🐇 What's the best recheck method for me?

🐢 If your file system supports, best way seems reflink to me. It's like a symlink but makes a copy when your file changes. Most of the widely used file systems don't support it though. If your files are read only and you don't have many links to the same files, you can use hardlink. If they are likely to change, you can use copy. If there are many links to same files, better to use symlink.

🐇 So, symlinks are not the best? Why did you select it?

🐢 I suspect most of the files in your cat pictures are duplicates. Xvc stores only one copy of these in cache and links all occurrences in the workspace to this copy. This is called deduplication. There are limits to number of hardlinks, so I recommended you to use symlinks. They are more visible. You can see they are links. Hardlinks are harder to detect.

🐇 Ah, when I type ls -l, they all show the cache location now.

🐢 If you have a models/ directory and want to track them as copies, you can tell Xvc:

$ xvc file track --recheck-method copy models/

It replaces previous symlinks with the copies of the files only in models/.

🐇 Can I have my data read only and models writable?

🐢 You can. Xvc keeps track of each file's recheck-method separately. Data can stay in read-only symlinks, and models can be copied so they can be updated and stored as different versions.

🐇 I have also scripts, what should I do with them?

🐢 Are you using Git for them?

🐇 Yep. They are in a separate repository. I think I can use the same repository now.

🐢 You can. Better to keep them in the same repository. They can be versioned with the data they use and models they produce. You can use standard Git commands to track them. If you track a file with Git, Xvc doesn't track it. It stays away from it.

🐇 You said we can create pipelines with Xvc as well. I created a multi-stage pipeline for cat picture models. It's like this:

graph LR
    cats["data/cats/"] --> pp-train["preprocess.py --train data/pp-train/"]
    pp-train --> train["train.py"]
    params["params.yaml"] --> train
    cat-ratings["cat-ratings.txt"] --> train
    train --> model["models/model.bin"]
    cats --> pp-test["preprocess.py --test data/pp-test/"]
    model --> test["test.py"]
    pp-test --> test
    test --> metrics["metrics.json"]
    test --> best-model["best-model.json"]
    best-model --> deploy["deploy.sh"]

🐢 It looks like a fairly complex pipeline. You can create a pipeline definition for it. For each separate command we'll have a step. How many different commands do you have?

🐇 A preprocess --train command, a preprocess --test command, a train command, a test command and a deploy command. Five.

🐢 Do you need more than one pipeline? Maybe you would like to put deployment to another pipeline?

🐇 No, I don't think so. I may have in the future.

🐢 Xvc has a default pipeline. We'll use it for now. If you need more pipelines you can create with xvc pipeline new.

🐇 How do I create step for commands?

🐢 Let's create the steps at once. Each step requires a name and a command.

$ xvc pipeline step new --step-name preprocess-train --command 'python3 src/preprocess.py --train data/cats data/pp-train/'

$ xvc pipeline step new --step-name preprocess-test --command 'python3 src/preprocess.py --test data/cats data/pp-test/'

$ xvc pipeline step new --step-name train --command 'python3 src/train.py data/pp-train/'

$ xvc pipeline step new --step-name test --command 'python3 src/test.py data/pp-test/ metrics.json'

$ xvc pipeline step new --step-name deploy --command 'python3 deploy.py models/model.bin /var/server/files/model.bin'

🐇 How do we define dependencies?

🐢 You can have many different types of dependencies. All are defined by xvc pipeline step dependency command. You can set up direct dependencies between steps, if one is invalidated, its dependents also run. You can set up file dependencies, if the file changes the step is invalidated and requires to run. There are other, more detailed dependencies like parameter dependencies which take a file in JSON or YAML format, then checks whether a value has changed. There are regular expression dependencies, for example if you have a piece of code in your training script that you change to update the parameters, you can define a regex dependency.

🐇 It looks I can use this for CSV files as well.

🐢 Yes. If your step depends not on the whole CSV file, but only specific rows, you can use regex dependencies. You can also specify line numbers of a file to depend.

🐇 My preprocess.py script depends on data/cats directory. My train.py script depends on params.yaml for some hyperparameters, and reads 5 Star ratings from cat-contest.txt. I want to deploy when the newly produced model is better than the older one by checking best-model.json. My deployment script doesn't update the deployment if the new model is not the best.

🐢 Let's see. For each step, you can use a single command to define its dependencies. For preprocess.py you'll depend to the data directory and the script itself. We want to run the step when the script changes. It's like this:

$ xvc pipeline step dependency --step-name preprocess-train --glob 'data/cats/*' --file src/preprocess.py

$ xvc pipeline step dependency --step-name preprocess-test --glob 'data/cats/*' --file src/preprocess.py

$ xvc pipeline step dependency --step-name train --glob 'data/pp-train/*' --file src/train.py --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'

$ xvc pipeline step dependency --step-name test --glob 'models/*' --directory data/pp-test/
? 2
error: unexpected argument '--directory' found

Usage: xvc pipeline step dependency <--step-name <STEP_NAME>|--generic <GENERICS>|--url <URLS>|--file <FILES>|--step <STEPS>|--glob_items <GLOB_ITEMS>|--glob <GLOBS>|--param <PARAMS>|--regex_items <REGEX_ITEMS>|--regex <REGEXES>|--line_items <LINE_ITEMS>|--lines <LINES>|--sqlite-query <SQLITE_FILE> <SQLITE_QUERY>>

For more information, try '--help'.

$ xvc pipeline step dependency --step-name deploy --file best-model.json

You must also define the outputs these steps produce, so when the output is missing or dependency is newer than the output, the step will require to rerun.

$ xvc pipeline step output --step-name preprocess-train --directory data/pp-train
? 2
error: unexpected argument '--directory' found

Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>

For more information, try '--help'.

$ xvc pipeline step output --step-name preprocess-test --directory data/pp-test
? 2
error: unexpected argument '--directory' found

Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>

For more information, try '--help'.

$ xvc pipeline step output --step-name train --directory models/
? 2
error: unexpected argument '--directory' found

Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>

For more information, try '--help'.

$ xvc pipeline step output --step-name test --file metrics.json  --file best-model.json
? 2
error: unexpected argument '--file' found

Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>

For more information, try '--help'.

$ xvc pipeline step output --step-name deploy --file /var/server/files/model.bin
? 2
error: unexpected argument '--file' found

Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>

For more information, try '--help'.

🐇 These commands become too long to type. You know, I'm a lazy hare and don't like to type much. Is there an easier way?

🐢 You can try source $(xvc aliases) in your Bash or Zsh, and get a bunch of aliases for these commands. xvc pipeline step output becomes xvcpso, xvc pipeline step dependency becomes xvcpsd, etc. You can see the whole list:

$ xvc aliases

alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'

🐇 Oh, there are many more commands.

🐢 Yep. More to come, you can use xvc pipeline export and after making the changes, you can use xvc pipeline import.

🐇 I don't need to delete the pipeline to rewrite everything, then?

🐢 You can export a pipeline, edit and import with a different name to test. When you want to run them, you specify their names.

🐇 Ah, yeah, that's the most important part. How do I run?

🐢 xvc pipeline run, or xvcpr. It takes the name of the pipeline and runs it. It sorts steps, checks if there are any cycles. The steps musn't have cycles, otherwise it's an infinite loop and computers don't like infinite loops like turtles do. Xvc runs steps in parallel if there are no common dependencies.

🐇 So, if I have multiple preprocessing steps that don't depend each other, they can run in parallel?

🐢 Yeah, they run in parallel. For example in your pipeline preprocess-train and preprocess-test can run in parallel, because they don't depend on each other.

🐇 Cool. I want to see the pipeline we created.

🐢 You can see it with xvc pipeline dag (xvcpd) It prints a mermaid.js diagram that you can paste to your files.

🐇 Better to have an image of this, maybe.

🐢 I'll inform the developer about it. Please tell him anything you'd like to see in the tool in Github or via email He's extremely introverted but tries to be a nice guy.

🐇 Ah, ok, I'll write to him about this.

Xvc for Software Development

Xvc for DVC Users

DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.

Note that this document refers mostly to Xvc v0.6 and DVC 2.30. Both commands are in development, and similarities and differences may change in time.

Similarities

The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.

Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC.

Xvc has the same optional and recommended reliance on Git but all features are available without Git. Xvc uses Git with its CLI interface like a user, without any reliance on a particular library.

Both of these commands use hashing the content to detect changes in files.

Both of these use DAGs to represent pipelines.

Conceptual Differences

stage vs. step: What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.

remmote vs storage: What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.

pipeline definitions: In DVC, there is a 1-1 correspondence between dvc.yaml files in a repository and the pipelines. When you want to create a new pipeline, you create a new file in DVC.

In Xvc, pipelines are abstract. They are defined with xvc pipeline family of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions.

Files in the user workspace; DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with dvc add, DVC creates a .dvc file next to it. Xvc only creates a .xvc/ directory in the repository root and only updates .gitignore files to hide tracked files from Git. You won't see any files added next to your data files.

cache-type vs recheck-method: Cache type, (or rather recheck method) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to the cache, another file copied from the cache, etc.

Command Differences

Warning

Some of the Xvc commands described here are still in development.

While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both git push and dvc push commands may look beneficial for understanding at first, as these two are analogous. However, giving the same name also hides important details that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes.)

dvc add can be replaced by xvc file track. dvc add creates a .dvc file (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.

Instead of deleting .dvc files to remove a file from DVC, you can use xvc file untrack. It can also restore all versions of an untracked file to a directory.

dvc check-ignore can be replaced by xvc check-ignore. Xvc version can be used against any other ignore filename. (.gitignore,.ignore, .fooignore...)

dvc checkout is replaced by xvc file recheck. There is a --recheck-method (shortened as --as) option in several Xvc commands to tell whether to check out as symlink, hardlink, reflink or copy.

dvc commit is replaced by xvc file carry-in. They both cache the files if they are changed.

There is no command similar to dvc config. You can either edit the configuration files, or modify configuration with -c options in each run. You can also supply all configuration from the environment. See Configuration.

dvc dag is replaced by xvc pipeline dag. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, instead provides either a Graphviz representation or mermaid diagram.

dvc data status and dvc status can be replaced by xvc file list. Xvc version doesn't provide information about the pipelines, or remote storages.

There is no command similar to dvc destroy in Xvc. There will be an xvc deinit command at some point. Until then, you can just delete .xvc/ directory and all .xvcignore files in your repository to destroy.

There is no command similar to dvc diff in Xvc.

There is no command similar to dvc doctor or dvc version. Version information should be visible in the help text. Unless compiled from source with feature flags, Xvc binaries don't have feature differences.

Currently, there are no commands corresponding to dvc exp set of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ.

dvc fetch is replaced by xvc file bring --no-recheck.

Instead of freezing "pipeline stages" as in dvc freeze, and unfreezing with dvc unfreeze, xvc pipeline step update --changed [never|always|by_dependencies] can be used to specify if/when to run a pipeline step.

Instead of dvc gc to "garbage-collect" files, you can use xvc file remove with various options.

There is no corresponding command for dvc get-url in Xvc. You can use wget or curl instead.

Currently there is no command to replace dvc get and dvc import, and dvc import-url. URL dependencies are supported in the pipeline with xvc pipeline step dependency --url.

Instead of dvc install like hooks, Xvc issues Git commands itself if git.auto_commit , git.auto_stage configuration options are set.

There is no corresponding command for dvc list-url.

dvc list is replaced by xvc file list for local paths. Its remote capabilities are not implemented but is on the roadmap.

Xvc doesn't mix files from different repositories in the same storage. There is an ID for each Xvc repo that's also used in remote storage paths.

Currently, there is no params/metrics tracking/diff similar to dvc params, dvc metrics or dvc plots commands in Xvc.

dvc move is replaced by xvc file move.

dvc push is replaced by xvc file send.

dvc pull is replaced by xvc file bring.

There are no commands similar to dvc queue for experiments in Xvc. Experiment tracking will probably be handled differently.

dvc remote set of commands are replaced by xvc storage set of commands. You can use xvc storage new for adding new storages. Currently, there is no "default remote" facility in Xvc. Instead of dvc remote modify, you can use xvc storage remove and xvc storage new.

There is no single command to replace dvc remove. For files, you can use xvc file delete. For pipelines steps, you can use ]xvc pipeline step remove

Instead of dvc repro, Xvc has xvc pipeline run. If you want to reproduce a pipeline, you can use xvc pipeline run again.

xvc root is for the same purpose as dvc root.

dvc run (that defines a stage in DVC pipeline and immediately runs it) can be replaced by xvc pipeline set of commands. xvc pipeline new for a new pipeline, xvc pipeline step new for a new step in the pipeline, xvc pipeline step dependency to specify dependencies of a step, xvc pipeline step output to specify outputs of a step and xvc pipeline run to run this pipeline.

Instead of dvc stage add, we have xvc pipeline step new. For dvc stage list, we have xvc pipeline step list.

There is no (need) for dvc protect or dvc unprotect commands in Xvc. "Cache type" of Xvc is not a repository-wide option, and called "recheck method". If you want to track a certain directory as symlink, and another as hardlink, you can do so with xvc file recheck --as. If you want identical files copied to one directory and linked in another, xvc file copy can help.

DVC needs dvc update for external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically.

DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.

Extra Features of Xvc

Xvc can use multiple of hashing functions, like BLAKE3, BLAKE2s, SHA2-256 and SHA3-256. More can be added upon request. The only requirement for hashes is having 32-hex digits (256 bits) of output.

In its pipelines, Xvc has more flexibility in defining dependencies. DVC supports files, directories and hyperparameters. Xvc supports additionally

  • globs
  • text file lines defined by line numbers,
  • text file lines defined by regular expressions,
  • URLs
  • Sqlite queries,

Technical Differences

  • DVC is written in Python. Xvc is written in Rust.

  • DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.

  • DVC tracks file/directory changes in separate .dvc files. Xvc tracks them in .json files in .xvc/store. There is no 1-1 correspondence between these files and the directory structure.

  • DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (xvc-ecs) in its core.

  • DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This provides inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated and when you want to delete all files associated with a repository, you can do so without the risk of deleting files used in other repositories.

  • DVC considers directories as file-equivalent entities to track with .dvc files pointing to .json files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files.

  • DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.

Benchmarking Xvc vs DVC

In this section, we'll write a few tests to see how Xvc and DVC perform in common tasks. This document is planned as reproducible to see the differences in performance. I'll update this time to time to see the differences, and I'll also add more tests.

This is mostly to satisfy my personal curiosity. I don't claim these are scientific experiments that describe the performance in all conditions.

We'll test the tools in the following scenarios:

  • Checking in small files: We'll unzip 15.000 images from Chinese-MNIST dataset and measure the time for dvc add and xvc file track
  • Checking out small files: We'll delete the files we track and recheck / checkout them using dvc checkout and xvc recheck
  • Pushing/sending the small files we added to S3
  • Pulling/bringing the small files we pushed from S3
  • Checking in and out large files: We'll create 100 large files using xvc-test-helper and repeat the above tests.
  • Running small pipelines: We'll create a pipeline with 10 steps to run simple commands.
  • Running medium sized pipelines: We'll create a pipeline with 100 steps to run simple commands.
  • Running large pipelines: We'll create a pipeline with 1000 steps to run simple commands.

Setup

This document uses the most recent versions of Xvc and DVC. DVC is installed via Homebrew.

$ dvc --version
3.30.3

$ xvc --version
xvc v0.6.4-alpha.0-300-g08c034a-modified

Init Repositories

Let's start by measuring the performance of initializing repositories.

$ git init
Initialized empty Git repository in [CWD]/.git/

$ hyperfine -r 1 'xvc init'
Benchmark 1: xvc init
  Time (abs ≡):         48.6 ms               [User: 11.0 ms, System: 21.3 ms]
 

$ hyperfine -r 1 'dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"'
Benchmark 1: dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"
  Time (abs ≡):        425.3 ms               [User: 205.7 ms, System: 86.3 ms]
 

$ git status -s
?? chinese_mnist.zip

Unzip the images

$ unzip -q chinese_mnist.zip
$ zsh -cl 'cp -r data/data xvc-data'
$ zsh -cl 'cp -r data/data dvc-data'
$ tree -d
.
├── data
│   └── data
├── dvc-data
└── xvc-data

5 directories

15K Small Files Performance

Xvc commits the changed metafiles automatically unless otherwise specified in the options. In the DVC command below, we also commit *.dvc files.

$ hyperfine -r 1 'xvc file track xvc-data/'
Benchmark 1: xvc file track xvc-data/
  Time (abs ≡):         3.655 s               [User: 0.931 s, System: 12.339 s]
 

$ hyperfine -r 1 --show-output 'dvc add dvc-data/ '
Benchmark 1: dvc add dvc-data/ 

To track the changes with git, run:

	git add .gitignore dvc-data.dvc

To enable auto staging, run:

	dvc config core.autostage true
  Time (abs ≡):        13.027 s               [User: 4.740 s, System: 6.765 s]
 

$ lsd -l

$ git status -s
 M .gitignore
?? chinese_mnist.zip
?? data/
?? dvc-data.dvc

Checkout a directory with 15K files

$ rm -rf xvc-data

$ hyperfine -r 1 'xvc file recheck xvc-data/'
Benchmark 1: xvc file recheck xvc-data/
  Time (abs ≡):         2.378 s               [User: 0.438 s, System: 2.152 s]
 

$ rm -rf dvc-data/

$ ls 
chinese_mnist.zip
data
dvc-data.dvc
xvc-data

$ hyperfine -r 1 --show-output 'dvc checkout dvc-data.dvc'
Benchmark 1: dvc checkout dvc-data.dvc
A       dvc-data/
  Time (abs ≡):         4.102 s               [User: 1.399 s, System: 2.155 s]
 

Large File Performance

$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.669660 secs (628017680 bytes/sec)

$ hyperfine -r 1 'xvc file track xvc-large-file'
Benchmark 1: xvc file track xvc-large-file
  Time (abs ≡):         1.499 s               [User: 0.816 s, System: 0.805 s]
 

$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.446919 secs (724695716 bytes/sec)

$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"

To track the changes with git, run:

	git add dvc-large-file.dvc .gitignore

To enable auto staging, run:

	dvc config core.autostage true
[main 72fd199] Added dvc-large-file to DVC
 2 files changed, 6 insertions(+)
 create mode 100644 dvc-large-file.dvc
  Time (abs ≡):         2.153 s               [User: 1.906 s, System: 0.203 s]
 

Commit/Carry-in Large Files

$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550065 secs (676472277 bytes/sec)

$ hyperfine -r 1 'xvc file carry-in xvc-large-file'
Benchmark 1: xvc file carry-in xvc-large-file
  Time (abs ≡):         1.024 s               [User: 0.629 s, System: 0.393 s]
 

$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550363 secs (676342250 bytes/sec)

$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"

To track the changes with git, run:

	git add dvc-large-file.dvc

To enable auto staging, run:

	dvc config core.autostage true
[main c74d783] Added dvc-large-file to DVC
 1 file changed, 1 insertion(+), 1 deletion(-)
  Time (abs ≡):         2.098 s               [User: 1.903 s, System: 0.189 s]
 

Pipeline with 10 Steps

Pipeline steps will depend on the following files.

$ xvc-test-helper create-directory-tree --directories 1 --files 10  --root pipeline-10

$ tree pipeline-10
pipeline-10
└── dir-0001
    ├── file-0001.bin
    ├── file-0002.bin
    ├── file-0003.bin
    ├── file-0004.bin
    ├── file-0005.bin
    ├── file-0006.bin
    ├── file-0007.bin
    ├── file-0008.bin
    ├── file-0009.bin
    └── file-0010.bin

2 directories, 10 files

Let's create 10 DVC stages to depend on these files:

$ zsh -cl "for f in pipeline-10/dir-0001/* ; do dvc stage add -q -n ${f:r:t} -d ${f} 'sha1sum $f'; done"

$ dvc stage list
file-0001  Depends on pipeline-10/dir-0001/file-0001.bin
file-0002  Depends on pipeline-10/dir-0001/file-0002.bin
file-0003  Depends on pipeline-10/dir-0001/file-0003.bin
file-0004  Depends on pipeline-10/dir-0001/file-0004.bin
file-0005  Depends on pipeline-10/dir-0001/file-0005.bin
file-0006  Depends on pipeline-10/dir-0001/file-0006.bin
file-0007  Depends on pipeline-10/dir-0001/file-0007.bin
file-0008  Depends on pipeline-10/dir-0001/file-0008.bin
file-0009  Depends on pipeline-10/dir-0001/file-0009.bin
file-0010  Depends on pipeline-10/dir-0001/file-0010.bin

Run the DVC pipeline

$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
  Time (abs ≡):        766.8 ms               [User: 482.4 ms, System: 218.7 ms]
 

Running without changed the dependencies

$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
  Time (mean ± σ):     455.8 ms ±  22.6 ms    [User: 342.3 ms, System: 107.4 ms]
  Range (min … max):   431.0 ms … 492.3 ms    5 runs
 

$ zsh -cl "for f in pipeline-10/dir-0001/* ; do xvc pipeline step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline step dependency -s ${f:r:t} --file ${f} ; done"

$ hyperfine -r 1 "xvc pipeline run"
Benchmark 1: xvc pipeline run
  Time (abs ≡):        229.8 ms               [User: 53.9 ms, System: 227.3 ms]
 

$ hyperfine -M 5 "xvc pipeline run"
Benchmark 1: xvc pipeline run
  Time (mean ± σ):     176.8 ms ±   4.0 ms    [User: 34.6 ms, System: 144.1 ms]
  Range (min … max):   173.0 ms … 183.0 ms    5 runs
 

Pipeline with 100 Steps

Pipeline steps will depend on the following files.

$ xvc-test-helper create-directory-tree --directories 1 --files 100 --root pipeline-100

$ tree -d pipeline-100
pipeline-100
└── dir-0001

2 directories

$ rm -f dvc.yaml

$ zsh -cl "for f in pipeline-100/dir-0001/* ; do dvc stage add -q -n s-${RANDOM} -d ${f} 'sha1sum $f'; done"

$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
  Time (abs ≡):        10.383 s               [User: 8.813 s, System: 1.072 s]
 

$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
  Time (mean ± σ):     637.3 ms ±   9.8 ms    [User: 467.4 ms, System: 161.1 ms]
  Range (min … max):   630.2 ms … 654.3 ms    5 runs
 

Let's create 100 Xvc steps to depend on the same files.

$ xvc pipeline new --pipeline-name p100

$ zsh -cl "for f in pipeline-100/dir-0001/* ; do xvc pipeline -p p100 step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline -p p100 step dependency -s ${f:r:t} --file ${f} ; done"

$ hyperfine -r 1 --show-output "xvc pipeline -p p100 run" 
Benchmark 1: xvc pipeline -p p100 run
  Time (abs ≡):        201.9 ms               [User: 39.6 ms, System: 168.4 ms]
 

$ hyperfine -M 5 "xvc pipeline -p p100 run"
Benchmark 1: xvc pipeline -p p100 run
  Time (mean ± σ):     198.7 ms ±   3.1 ms    [User: 39.9 ms, System: 163.9 ms]
  Range (min … max):   196.0 ms … 203.8 ms    5 runs
 

Note that the first run of the commands is drastically different. DVC runs all stages sequentially, in around 9.3 seconds while Xvc runs them in parallel in 0.2 seconds. Let's also measure the average run time of a sha1sum command to consider how much of these passes in actual commands.

$ hyperfine 'sha1sum pipeline-100/dir-0001/file-0001.bin'
Benchmark 1: sha1sum pipeline-100/dir-0001/file-0001.bin
  Time (mean ± σ):       1.2 ms ±   0.2 ms    [User: 0.4 ms, System: 0.5 ms]
  Range (min … max):     0.9 ms …   2.7 ms    535 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 

Pipeline with 1000 Steps

In this case we'll just measure the run times of 10000 ls commands.

$ rm -f dvc.yaml

$ zsh -cl "for i in {1..1000}; do dvc stage add -q -n s-${i} 'ls'; done"

$ zsh -cl 'dvc stage list | wc -l'
    1000

$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
  Time (abs ≡):        469.534 s               [User: 449.463 s, System: 17.257 s]
 

$ hyperfine -M 5 "dvc repro"
? interrupted
Benchmark 1: dvc repro

$ xvc pipeline new --pipeline-name p1000

$ zsh -cl "for i in {1..1000} ; do xvc --skip-git pipeline -p p1000 step new -s s-${i} --command 'ls' ; done"

$ zsh -cl 'xvc pipeline step list --names-only | wc -l'
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
      10

$ hyperfine -r 1 --show-output "xvc pipeline -p p1000 run" 
Benchmark 1: xvc pipeline -p p1000 run
  Time (abs ≡):        460.0 ms               [User: 78.7 ms, System: 376.8 ms]
 

$ hyperfine -M 5 "xvc pipeline -p p1000 run"
Benchmark 1: xvc pipeline -p p1000 run
  Time (mean ± σ):     404.5 ms ±  10.6 ms    [User: 79.0 ms, System: 366.7 ms]
  Range (min … max):   397.4 ms … 423.2 ms    5 runs
 

How-To Guides

How to Compile Xvc

Why would you compile?

  • You want to use Xvc on a platform that we don't distribute the binary.
  • You want a smaller binary size by removing features that you don't use.
  • You like your software compiled.
  • It's easier to use cargo than other means to install for you.
  • Fix a bug for yourself.
  • Contribute!

Install Rust

You must have Rust installed on your system.

If you have a sensible terminal on your system:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Otherwise refer to other installation methods page.

Clone the repository

Clone the repository from Emre's Github repository.

$ git clone https://github.com/iesahin/xvc -b latest

The latest tag refers to the latest stable release. If you're willing to fight with compilation errors, you can also use main branch directly.

Compile without default features

Xvc with Git Branches

When you're working with multiple branches in Git, you may ask Xvc to checkout a branch and commit to another branch. These operations are performed at the beginning, and at the end of Xvc operations. You can use --from-ref and --to-branch options to checkout a Git reference before an Xvc operation, and commit the results to a certain Git branch.

Checkout and commit operations sandwich Xvc operations.

graph LR
   checkout["git checkout $REF"] --> xvc
   xvc["xvc operation"] --> stash["git stash --staged"]
   stash --> branch["git checkout --branch $TO_BRANCH"]
   branch --> commit["git add .xvc && git commit"]

If --from-ref is not given, initial git checkout is not performed. Xvc operates in the current branch. This is the default behavior.

$ git init --initial-branch=main
...
$ xvc init
? 0

$ ls
data.txt

$ xvc --to-branch data-file file track data.txt
Switched to a new branch 'data-file'

$ git branch
* data-file
  main

$ git status -s

$ xvc file list data.txt
FC          19 2023-06-08 11:47:18 c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


If you return to main branch, you'll see the file is tracked by neither Git nor Xvc.

$ git checkout main
...
$ xvc file list data.txt
FX          19 2023-06-08 11:47:18          c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:           0


$ git status -s
?? data.txt

Now, we'll add a step to the default pipeline to get an uppercase version of the data. We want this to work only in data

$ xvc --from-ref data-file pipeline step new --step-name to-uppercase --command 'cat data.txt | tr a-z A-Z > uppercase.txt'
Switched to branch 'data-file'

$ xvc pipeline step dependency --step-name to-uppercase --file data.txt

$ xvc pipeline step output --step-name to-uppercase --output-file uppercase.txt

Note that xvc pipeline step dependency and xvc pipeline step output commands don't need --from-ref and --to-branch options, as they run in data-file branch already.

Now, we want to have this new version of data available only in uppercase branch.

$ xvc --from-ref data-file --to-branch uppercase pipeline run
Already on 'data-file'
[DONE] to-uppercase (cat data.txt | tr a-z A-Z > uppercase.txt)
Switched to a new branch 'uppercase'

$ git branch
  data-file
  main
* uppercase

You can use this for experimentation. Whenever you have a pipeline that you want to run and keep the results in another Git branch, you can use --to-branch for experimentation.

$ xvcpr --from-ref data-file --to-branch another-uppercase
$ git-branch
* another-uppercase
uppercase
data-file
main

The pipeline always runs, because in data-file branch uppercase.txt is always missing. It's stored only in the resulting branch you give by --to-branch.

Turning off Automated Git Operations

By default Xvc automates all common git operations. When you run an Xvc operation that affects the files under .xvc directory, the changes are committed to the repository automatically.

Git autmation runs in Git repositories.

$ git init
Initialized empty Git repository in [CWD]/.git/

$ xvc init

We'll show these examples in the following directory tree.

$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20231012
$ tree
.
└── dir-0001
    ├── file-0001.bin
    ├── file-0002.bin
    └── file-0003.bin

2 directories, 3 files

When you begin to track a file in the repository, Xvc adds the file to .gitignore in the directory the file is found.

$ xvc file track dir-0001/file-0001.bin

$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
/file-0001.bin

Xvc also adds a commit for all the changes caused by the command.

$ git log -n 1
commit [..]
Author: [..]
Date:   [..]

    Xvc auto-commit after '[..]xvc file track dir-0001/file-0001.bin'

The commit message includes the command you gave to run to find the exact change in history.

If you don't track a file with Xvc, they are not added to .gitignore and you can see them with git status.

$ git status -s
?? dir-0001/file-0002.bin
?? dir-0001/file-0003.bin

If you want to skip this automated Git operations, you can add --skip-git flag to commands.

$ xvc --skip-git file track dir-0001/file-0002.bin

$ git status -s
 M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? dir-0001/file-0003.bin

Note that, --skip-git flag doesn't affect the files to be added to .gitignore files.

$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
/file-0001.bin
### Following 1 lines are added by xvc on [..]
/file-0002.bin

You can use usual Git workflow to add and commit the files.

$ git add .xvc dir-0001/.gitignore
$ git commit -m "Began to track dir-0001/file-0002.bin with Xvc"
[main [..]] Began to track dir-0001/file-0002.bin with Xvc
 7 files changed, 8 insertions(+)
 create mode 100644 .xvc/ec/[..]
 create mode 100644 .xvc/store/[..].json
 create mode 100644 .xvc/store/[..].json
 create mode 100644 .xvc/store/[..].json
 create mode 100644 .xvc/store/[..].json
 create mode 100644 .xvc/store/[..].json

If you never want Xvc to handle commits, you can set git.use_git option in .xvc/config file to false or set XVC_git.use_git=false in the environment.

$ XVC_git.use_git=false xvc file track dir-0001/file-0003.bin

$ git status -s
 M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]

How to create a data pipeline with Xvc

A data pipeline starts from data and ends with models. Between there is various data transformations and model training. We try to make all pieces reproducible and Xvc helps with this goal.

In this document, we'll create the following pipeline for a digit recognition system. Our purpose is to show how Xvc helps in versioning data, so this document doesn't try to achieve a high classification performance.

graph LR
  A[Data Gathering] --> B[Splitting Test and Train Sets]
  B --> C[Preprocessing Images into Numpy Arrays]
  C --> D[Training Model]
  D --> E[Sharing Data and Models]

Info

This document can be more verbose than usual, because all commands in this document are run on a clean directory during tests to check outputs. Some of the idiosyncrasies, e.g., running certain commands with zsh -c are due to this reason.

Although you can do without, most of the times Xvc runs in a Git repository. This allows to version control both the data and the code together.

$ git init
Initialized empty Git repository in [CWD]/.git/

$ xvc init

In this HOWTO, we use Chinese MNIST dataset to create an image classification pipeline. We already downloaded it from kaggle.

$ ls -l
total 21112
-rw-r--r--  1 iex  staff  10792680 Nov 17 19:46 chinese_mnist.zip
-rw-r--r--  1 iex  staff      1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r--  1 iex  staff        40 Dec  1 11:59 requirements.txt
-rw-r--r--  1 iex  staff      4436 Dec  1 22:52 train.py

Let's start by tracking the data file with Xvc.

$ xvc file track chinese_mnist.zip --as symlink

The default recheck (checkout) method is copy that means the file is duplicated in the workspace as a writable file. We don't need to write over this data file, we'll only read from it, so we set the recheck type as symlink.

$ ls -l
total 32
lrwxr-xr-x  1 iex  staff   195 Dec  2 12:10 chinese_mnist.zip -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/0.zip
-rw-r--r--  1 iex  staff  1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r--  1 iex  staff    40 Dec  1 11:59 requirements.txt
-rw-r--r--  1 iex  staff  4436 Dec  1 22:52 train.py

The long directory name is the BLAKE-3 hash of the data file.

As we'll work with the file contents, let's unzip the data file.

$ unzip -q chinese_mnist.zip

$ ls -l
total 32
lrwxr-xr-x  1 iex  staff   195 Dec  2 12:10 chinese_mnist.zip -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/0.zip
drwxr-xr-x  4 iex  staff   128 Nov 17 19:45 data
-rw-r--r--  1 iex  staff  1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r--  1 iex  staff    40 Dec  1 11:59 requirements.txt
-rw-r--r--  1 iex  staff  4436 Dec  1 22:52 train.py

Now we have the data directory with the following structure:

$ tree -d data
data
└── data

2 directories

Let's track the data directory as well with Xvc.

$ xvc file track data --as symlink

The reason we're tracking the data directory separately is that we'll use different subsets as training, validation, and test data.

Let's list the track status of files first.

$ xvc file list data/data/input_9_9_*
SS         [..] 3a714d65          data/data/input_9_9_9.jpg
SS         [..] 9ffccc4d          data/data/input_9_9_8.jpg
SS         [..] 5d6312a4          data/data/input_9_9_7.jpg
SS         [..] 7a0ddb0e          data/data/input_9_9_6.jpg
SS         [..] 2047d7f3          data/data/input_9_9_5.jpg
SS         [..] 10fcf309          data/data/input_9_9_4.jpg
SS         [..] 0bdcd918          data/data/input_9_9_3.jpg
SS         [..] aebcbc03          data/data/input_9_9_2.jpg
SS         [..] 38abd173          data/data/input_9_9_15.jpg
SS         [..] 7c6a9003          data/data/input_9_9_14.jpg
SS         [..] a9f04ad9          data/data/input_9_9_13.jpg
SS         [..] 2d372f95          data/data/input_9_9_12.jpg
SS         [..] 8fe799b4          data/data/input_9_9_11.jpg
SS         [..] ee35e5d5          data/data/input_9_9_10.jpg
SS         [..] 7576894f          data/data/input_9_9_1.jpg
Total #: 15 Workspace Size:        2925 Cached Size:        8710


xvc file list command shows the tracking status. Initial two characters shows the tracking status, SS means the file is tracked as symlink and is available in the workspace as a symlink. The next column shows the file size, then the last modified date, then the BLAKE-3 hash of the file, and finally the file name. The empty column contains the actual hash of the file if the file is available in the workspace. Here it's empty because the workspace file is a link to the file in cache.

The summary line shows the total size of the files and the size they occupy in the workspace.

Splitting Train, Validation, and Test Sets

The first step of the pipeline is to create subsets of the data.

The data set contains 15 classes. It has 10 samples for each of these classes from 100 different people. As we'll train a Chinese digit recognizer, we'll first divide volunteers 1-60 for training, 61-80 for validation, and 81-100 for testing. This will ensure that the model is not trained with the same person's handwriting.

$ xvc file copy --name-only data/data/input_?_* data/train/
$ xvc file copy --name-only data/data/input_[12345]?_* data/train/
$ xvc file copy --name-only data/data/input_100_* data/train/
$ xvc file copy --name-only data/data/input_[67]?_* data/validate/
$ xvc file copy --name-only data/data/input_[89]?_* data/test/

$ tree -d data/
data/
├── data
├── test
├── train
└── validate

5 directories

If you look at the contents of these directories, you'll see that they are symbolic links to the same files we started to track.

Let's check the number of images in each set.

$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
    9000

$ zsh -c 'ls -1 data/validate/*.jpg | wc -l'
    3000

$ zsh -c 'ls -1 data/test/*.jpg | wc -l'
    3000

The first step in the pipeline will be rechecking (checking out) these subsets.

$ xvc pipeline step new -s recheck-data --command 'xvc file recheck data/train/ data/validate/ data/test/'

xvc file recheck is used in to instate files from Xvc cache. Let's test the pipeline by first deleting the files we manually created.

$ rm -rf data/train data/validate data/test

We run the steps we created.

$ xvc pipeline run
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)

If we check the contents of the directories, we'll see that they are back.

$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
    9000

Preprocessing Images into Numpy Arrays

graph LR
  A[Data Gathering ✅]  --> B[Splitting Test and Train Sets ✅]
  B --> C[Preprocessing Images into Numpy Arrays]
  C --> D[Training Model]
  D --> E[Sharing Data and Models]

The Python script to train a model runs with Numpy arrays. So we'll convert each of these directories with images into two numpy arrays. One of the arrays will keep $n$ 64x64 images and the other will keep $n$ labels for these images.

$ xvc pipeline step new --step-name create-train-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/train/'
$ xvc pipeline step new --step-name create-test-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/test/'
$ xvc pipeline step new --step-name create-validate-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/validate/'

These commands will run when the image files in those directories will change. Xvc can keep track of file groups and invalidate a step when the content of any of these files change. Moreover, it's possible to track which files have changed if there are too many files. We don't need this feature of tracking individual items in globs, so we'll use a glob dependency.

$ xvc pipeline step dependency --step-name create-train-array --glob 'data/train/*.jpg'
$ xvc pipeline step dependency --step-name create-test-array --glob 'data/test/*.jpg'
$ xvc pipeline step dependency --step-name create-validate-array --glob 'data/validate/*.jpg'

Now we have three more steps that depend on changed files. The script depends on OpenCV to read images. Python best practices recommend to create a separate virtual environment for each project. We'll also make sure that the venv is created and the requirements are installed before running the script.

Create a command to initialize the virtual environment. It will run if there is no .venv/bin/activate file.

$ xvc pipeline step new --step-name init-venv --command 'python3 -m venv .venv'
$ xvc pipeline step dependency --step-name init-venv --generic 'echo "$(hostname)/$(pwd)"'

We used --generic dependency that runs a command and checks its output to see whether the step requires to be run again. We only want to run init-env once per deployment, so checking output of hostname and pwd is better than existence of a file. File dependencies must be available before running the pipeline to record their metadata. There is no such restriction for generic dependencies.

Then, another step that depends on init-venv and requirements.txt will install the dependencies.

$ xvc pipeline step new --step-name install-requirements --command '.venv/bin/python3 -m pip install -r requirements.txt'
$ xvc pipeline step dependency --step-name install-requirements --step init-venv
$ xvc pipeline step dependency --step-name install-requirements --file requirements.txt

Note that, unlike other tools, you can specify direct dependencies between steps in Xvc. When a pipeline step must wait another step to finish successfully, a dependency between these two can be defined.

The above create-*-array steps will depend on to install-requirements to ensure that requirements are installed when the scripts are run.

$ xvc pipeline step dependency --step-name create-train-array --step install-requirements

$ xvc pipeline step dependency --step-name create-validate-array --step install-requirements

$ xvc pipeline step dependency --step-name create-test-array --step install-requirements

Now, as the pipeline grows, it may be nice to see the graph what we have done so far.

$ xvc pipeline dag --format mermaid
flowchart TD
    n0["recheck-data"]
    n1["create-train-array"]
    n2["data/train/*.jpg"] --> n1
    n3["install-requirements"] --> n1
    n4["create-test-array"]
    n5["data/test/*.jpg"] --> n4
    n3["install-requirements"] --> n4
    n6["create-validate-array"]
    n7["data/validate/*.jpg"] --> n6
    n3["install-requirements"] --> n6
    n8["init-venv"]
    n9["echo "$(hostname)/$(pwd)""] --> n8
    n3["install-requirements"]
    n8["init-venv"] --> n3
    n10["requirements.txt"] --> n3


flowchart TD
    n0["recheck-data"]
    n1["create-train-array"]
    n2["data/train/*.jpg"] --> n1
    n3["install-requirements"] --> n1
    n4["create-test-array"]
    n5["data/test/*.jpg"] --> n4
    n3["install-requirements"] --> n4
    n6["create-validate-array"]
    n7["data/validate/*.jpg"] --> n6
    n3["install-requirements"] --> n6
    n8["init-venv"]
    n9[".venv/bin/activate"] --> n8
    n3["install-requirements"]
    n8["init-venv"] --> n3
    n10["requirements.txt"] --> n3

dag command can also produce GraphViz DOT output. For larger graphs, it may be more suitable. We'll use DOT to create images in later sections.

Let's run the pipeline at this point to test.

$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/mod.rs::343] Pipeline Graph:
digraph {
    0 [ label = "(30009, 11376621678660215310)" ]
    1 [ label = "(30012, 12907533602545881359)" ]
    2 [ label = "(30010, 8484021102039729264)" ]
    3 [ label = "(30011, 9338166212381570306)" ]
    4 [ label = "(30016, 17450406389616117859)" ]
    5 [ label = "(30018, 2681008057348839262)" ]
    1 -> 5 [ label = "Step" ]
    2 -> 5 [ label = "Step" ]
    3 -> 5 [ label = "Step" ]
    5 -> 4 [ label = "Step" ]
}


[INFO] No dependency steps for step recheck-data
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] No dependency steps for step init-venv
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-test-array
[INFO] Waiting for dependency steps for step create-train-array
[INFO] [init-venv] Dependencies has changed
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[DONE] init-venv (python3 -m venv .venv)
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] Dependencies has changed
[OUT] [install-requirements] Collecting opencv-python (from -r requirements.txt (line 1))
  Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting torch (from -r requirements.txt (line 2))
  Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting pyyaml (from -r requirements.txt (line 3))
  Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting scikit-learn (from -r requirements.txt (line 4))
  Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting numpy>=1.21.2 (from opencv-python->-r requirements.txt (line 1))
  Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Collecting filelock (from torch->-r requirements.txt (line 2))
  Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions (from torch->-r requirements.txt (line 2))
  Using cached typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch->-r requirements.txt (line 2))
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch->-r requirements.txt (line 2))
  Using cached networkx-3.2.1-py3-none-any.whl.metadata (5.2 kB)
Collecting jinja2 (from torch->-r requirements.txt (line 2))
  Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting fsspec (from torch->-r requirements.txt (line 2))
  Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=1.5.0 (from scikit-learn->-r requirements.txt (line 4))
  Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl.metadata (165 kB)
Collecting joblib>=1.1.1 (from scikit-learn->-r requirements.txt (line 4))
  Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn->-r requirements.txt (line 4))
  Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch->-r requirements.txt (line 2))
  Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch->-r requirements.txt (line 2))
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_11_0_arm64.whl (33.1 MB)
Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl (59.6 MB)
Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl (167 kB)
Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl (9.4 MB)
Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Using cached filelock-3.13.1-py3-none-any.whl (11 kB)
Using cached fsspec-2023.10.0-py3-none-any.whl (166 kB)
Using cached networkx-3.2.1-py3-none-any.whl (1.6 MB)
Using cached typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl (17 kB)
Installing collected packages: mpmath, typing-extensions, threadpoolctl, sympy, pyyaml, numpy, networkx, MarkupSafe, joblib, fsspec, filelock, scipy, opencv-python, jinja2, torch, scikit-learn
Successfully installed MarkupSafe-2.1.3 filelock-3.13.1 fsspec-2023.10.0 jinja2-3.1.2 joblib-1.3.2 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.2 opencv-python-4.8.1.78 pyyaml-6.0.1 scikit-learn-1.3.2 scipy-1.11.4 sympy-1.12 threadpoolctl-3.2.0 torch-2.1.1 typing-extensions-4.8.0

[DONE] install-requirements (.venv/bin/python3 -m pip install -r requirements.txt)
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] [create-validate-array] Dependencies has changed
[INFO] [create-train-array] Dependencies has changed
[INFO] [create-test-array] Dependencies has changed
[DONE] create-validate-array (.venv/bin/python3 image_to_numpy_array.py --dir data/validate/)
[DONE] create-test-array (.venv/bin/python3 image_to_numpy_array.py --dir data/test/)
[DONE] create-train-array (.venv/bin/python3 image_to_numpy_array.py --dir data/train/)

Now, when we take a look at the data directories, we find images.npy and classes.npy files.

$ zsh -cl 'ls -l data/train/*.npy'
-rw-r--r--  1 iex  staff      72128 Dec  2 12:11 data/train/classes.npy
-rw-r--r--  1 iex  staff  110592128 Dec  2 12:11 data/train/images.npy

$ zsh -cl 'ls -l data/test/*.npy'
-rw-r--r--  1 iex  staff     24128 Dec  2 12:11 data/test/classes.npy
-rw-r--r--  1 iex  staff  36864128 Dec  2 12:11 data/test/images.npy

$ zsh -cl 'ls -l data/validate/*.npy'
-rw-r--r--  1 iex  staff     24128 Dec  2 12:11 data/validate/classes.npy
-rw-r--r--  1 iex  staff  36864128 Dec  2 12:11 data/validate/images.npy

Train a model

Now we have built the NumPy arrays, we can train a model. We'll use a simple convolutional neural network as a showcase. This is by no means a state-of-art solution, so the results will be less than perfect.

graph LR
  A[Data Gathering ✅]  --> B[Splitting Test and Train Sets ✅]
  B --> C[Preprocessing Images into Numpy Arrays ✅]
  C --> D[Training Model]
  D --> E[Sharing Data and Models]

The script receives training, validation and testing directories, loads the data from Numpy arrays we just produced, loads hyperparameters from a file called params.yaml, trains the model, tests it and writes the results and model to a file. It's a very involved piece produced with the assistance of GPT-4.

We first define the step to run the command:

$ xvc pipeline step new --step-name train-model --command '.venv/bin/python3 train.py  --train_dir data/train/ --val_dir data/validate --test_dir data/test'

The step will depend to array generation steps by depending on the files they produce. In order to define a dependency between train-model and create-train-array step, we must tell that create-array-dependency outputs a file called images.npy. We can do this by using --file option of step output command.

$ xvc pipeline step output --step-name create-train-array --output-file data/train/images.npy

$ xvc pipeline step output --step-name create-train-array --output-file data/train/classes.npy

$ xvc pipeline step dependency --step-name train-model --file data/train/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/train/classes.npy

Note that this operation is different from creating a direct dependency between steps. There may be multiple steps creating the same outputs and there may be multiple steps depending on the same files. Preferring direct (--step) dependencies and indirect (--file) dependencies is a matter of taste and use.

We'll create these dependencies for other files as well.

$ xvc pipeline step output --step-name create-test-array --output-file data/test/images.npy

$ xvc pipeline step output --step-name create-test-array --output-file data/test/classes.npy

$ xvc pipeline step dependency --step-name train-model --file data/test/images.npy

$ xvc pipeline step dependency --step-name train-model --file data/test/classes.npy

$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/images.npy

$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/classes.npy

$ xvc pipeline step dependency --step-name train-model --file data/validate/images.npy

$ xvc pipeline step dependency --step-name train-model --file data/validate/classes.npy

Before running the pipeline, let's see the pipeline DAG once more. This time in DOT format.

$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="recheck-data";];n1[shape=box;label="create-train-array";];n2[shape=folder;label="data/train/*.jpg";];n2->n1;n3[shape=box;label="install-requirements";];n3->n1;n4[shape=note;color=black;label="data/train/images.npy";];n1->n4;n5[shape=note;color=black;label="data/train/classes.npy";];n1->n5;n6[shape=box;label="create-test-array";];n7[shape=folder;label="data/test/*.jpg";];n7->n6;n3[shape=box;label="install-requirements";];n3->n6;n8[shape=note;color=black;label="data/test/images.npy";];n6->n8;n9[shape=note;color=black;label="data/test/classes.npy";];n6->n9;n10[shape=box;label="create-validate-array";];n11[shape=folder;label="data/validate/*.jpg";];n11->n10;n3[shape=box;label="install-requirements";];n3->n10;n12[shape=note;color=black;label="data/validate/images.npy";];n10->n12;n13[shape=note;color=black;label="data/validate/classes.npy";];n10->n13;n14[shape=box;label="init-venv";];n15[shape=trapezium;label="echo /"$(hostname)/$(pwd)/"";];n15->n14;n3[shape=box;label="install-requirements";];n14[shape=box;label="init-venv";];n14->n3;n16[shape=note;label="requirements.txt";];n16->n3;n17[shape=box;label="train-model";];n4[shape=note;label="data/train/images.npy";];n4->n17;n5[shape=note;label="data/train/classes.npy";];n5->n17;n8[shape=note;label="data/test/images.npy";];n8->n17;n9[shape=note;label="data/test/classes.npy";];n9->n17;n12[shape=note;label="data/validate/images.npy";];n12->n17;n13[shape=note;label="data/validate/classes.npy";];n13->n17;}

It's not the most readable graph description but you can feed the output to dot command to create an SVG file.

$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline1.svg'

Note that, as we forgot to create a params.yaml file containing the hyperparameters. When a step in the pipeline doesn't run successfully, its dependent steps won't be run. Let's add a params.yaml file and add it as a dependency to the train step.

$ zsh -cl 'echo "batch_size: 4" > params.yaml'
$ zsh -cl 'echo "epochs: 2" >> params.yaml'
$ xvc pipeline step  dependency --step-name train-model --param params.yaml::batch_size
$ xvc pipeline step  dependency --step-name train-model --param params.yaml::epochs

With the above commands, the pipeline depends directly to these values. Even if the file contains other values, changing them won't invalidate the train-model step.

We can also specify the model and the results as output and the graph will show them.

$ xvc pipeline step output --step-name train-model --output-file model.pth
$ xvc pipeline step output --step-name train-model --output-metric results.json

Let's see the pipeline DAG once more:

$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline2.svg'

We're ready to run the pipeline and train the model.

$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::343] Pipeline Graph:
digraph {
    0 [ label = "(30024, 14850552671149047786)" ]
    1 [ label = "(30009, 11376621678660215310)" ]
    2 [ label = "(30011, 9338166212381570306)" ]
    3 [ label = "(30010, 8484021102039729264)" ]
    4 [ label = "(30012, 12907533602545881359)" ]
    5 [ label = "(30016, 17450406389616117859)" ]
    6 [ label = "(30018, 2681008057348839262)" ]
    2 -> 6 [ label = "Step" ]
    3 -> 6 [ label = "Step" ]
    4 -> 6 [ label = "Step" ]
    6 -> 5 [ label = "Step" ]
    0 -> 2 [ label = "File" ]
    0 -> 3 [ label = "File" ]
    0 -> 4 [ label = "File" ]
}


[INFO] No dependency steps for step init-venv
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] Waiting for dependency steps for step train-model
[INFO] No dependency steps for step recheck-data
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-train-array
[INFO] Waiting for dependency steps for step create-test-array
[INFO] [init-venv] No changed dependencies. Skipping thorough comparison.
[INFO] [init-venv] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] No changed dependencies. Skipping thorough comparison.
[INFO] [install-requirements] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] [create-test-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-test-array] No missing Outputs and no changed dependencies
[INFO] [create-validate-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-validate-array] No missing Outputs and no changed dependencies
[INFO] [create-train-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-train-array] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step train-model
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[INFO] [train-model] Dependencies has changed
[OUT] [train-model] [1,  2000] loss: 0.921
Accuracy of the network on the validation images: 72 %
[2,  2000] loss: 0.426
Accuracy of the network on the validation images: 83 %
Confusion Matrix:
[[174   0   0   1   2   0   1   2   0   2   0  14   0   1   3]
 [  1 132  60   0   0   0   1   0   0   0   0   5   1   0   0]
 [  3   1 157  34   0   0   3   0   0   0   1   1   0   0   0]
 [  2   0  34 160   0   2   2   0   0   0   0   0   0   0   0]
 [  1   0   0   0 186   0   0   1   0   2   0   9   0   0   1]
 [  3   0  11  12   0 145   1   0   0   9   1  12   3   2   1]
 [  3   1   1   0   1   0 133   8  16   9   6  10   2  10   0]
 [  0   0   0   0   3   1   5 145   3   8  25   2   1   1   6]
 [  0   0   0   0   0   0   1   1 181   4   1   1   0   4   7]
 [  2   0   0   0   2   1   0   3   7 142   4   3   0   7  29]
 [  0   0   0   0   1   0   1   0   0   1 193   2   2   0   0]
 [  4   0   0   0  21   4   0   5   1   1   4 152   1   4   3]
 [  0   1   1   1   0   1   3   1   0   0  55   4 132   0   1]
 [  5   0   0   0   2   0   0   2   0   0   1  36   0 153   1]
 [  0   0   0   0   8   0   0   1   2   5   0   0   0   7 177]]

[DONE] train-model (.venv/bin/python3 train.py  --train_dir data/train/ --val_dir data/validate --test_dir data/test)

We now have a model and a result file. Let's track the model with Xvc as well.

$ xvc file track model.pth results.json

Sharing Data and Models

graph LR
  A[Data Gathering ✅]  --> B[Splitting Test and Train Sets ✅]
  B --> C[Preprocessing Images into Numpy Arrays ✅]
  C --> D[Training Model ✅]
  D --> E[Sharing Data and Models]

Sharing a machine learning project with Xvc means to share the Git repository and the data and model files that are tracked by Xvc in this repository. For the first, we can use any kind of Git remote, e.g. Github. Xvc doesn't require any special setup (like Git-LFS) to share binary files.

In order to share the binary files, we need to specify an Xvc storage. This can be on a local folder, an SSH host with rsync, AWS S3 bucket or any of the supported storage backends. (See xvc storage new documentation for the full list.)

In this example, we'll create a new S3 bucket and share all files there.

$ xvc storage new s3 --name my-s3 --bucket-name xvc-test --region eu-central-1 --storage-prefix how-to-create-a-pipeline
$ xvc file send
? 2
error: the following required arguments were not provided:
  --remote <REMOTE>

Usage: xvc file send --remote <REMOTE> [TARGETS]...

For more information, try '--help'.

These two commands will define a new remote storage and sends all files to this storage. When you want to share the pipeline and all code and data it runs with, they can clone the repository and run the following command to get the files. Don't forget to push the most recent version of your repository.

$ git push
# On another machine
$ git clone git@github.com:my-user/my-ml-pipeline
$ xvc file bring

Note that, the second time there is no need to configure the remote storage, but the user must have AWS credentials in their environment. You can also automate this on Github and train your pipelines on CI.

In this how-to we created an end-to-end machine learning pipeline. Please ask about any issues that are not clear in the comment box below. Thank you for reading so far.

Command Reference

Synopsis

$ xvc --help
Xvc CLI to manage data and ML pipelines

Usage: xvc [OPTIONS] <COMMAND>

Commands:
  file          File and directory management commands
  init          Initialize an Xvc project
  pipeline      Pipeline management commands
  storage       Storage (cloud) management commands
  root          Find the root directory of a project
  check-ignore  Check whether files are ignored with `.xvcignore`
  aliases       Print command aliases to be sourced in shell files
  help          Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...             Output verbosity. Use multiple times to increase the output detail
      --quiet                  Suppress all output
      --debug                  Turn on all logging to $TMPDIR/xvc.log
  -C <WORKDIR>                 Set working directory for the command. It doesn't create a new shell, or change the directory [default: .]
  -c, --config <CONFIG>        Configuration options set from the command line in the form section.key=value You can use multiple times
      --no-system-config       Ignore system configuration file
      --no-user-config         Ignore user configuration file
      --no-project-config      Ignore project configuration file (.xvc/config)
      --no-local-config        Ignore local (gitignored) configuration file (.xvc/config.local)
      --no-env-config          Ignore configuration options obtained from environment variables
      --skip-git               Don't run automated Git operations for this command. If you want to run git commands yourself all the time, you can set `git.auto_commit` and `git.auto_stage` options in the configuration to False
      --from-ref <FROM_REF>    Checkout the given Git reference (branch, tag, commit etc.) before performing the Xvc operation. This runs `git checkout <given-value>` before running the command
      --to-branch <TO_BRANCH>  If given, create (or checkout) the given branch before committing results of the operation. This runs `git checkout --branch <given-value>` before committing the changes
  -h, --help                   Print help
  -V, --version                Print version

Subcommands

  • file: File and directory management commands
  • init: Initialize an Xvc project
  • pipeline: Pipeline management commands
  • storage: Storage (cloud) management commands
  • root: Find the root directory of a project
  • check-ignore: Check whether files are ignored with .xvcignore
  • aliases Print command aliases to be sourced in shell files

xvc init

Synopsis

$ xvc init --help
Initialize an Xvc project

Usage: xvc init [OPTIONS]

Options:
      --path <PATH>  Path to the directory to be intialized. (default: current directory)
      --no-git       Don't require Git
      --force        Create the repository even if already initialized. Overwrites the current .xvc directory Resets all data and guid, etc
  -h, --help         Print help
  -V, --version      Print version

Examples

To initialize a blank Xvc repository, initialize Git first and run xvc init.

$ cd my-project-1
$ git init
...
$ xvc init
? 0

The command doesn't print anything upon success.

If you want to initialize

File Management

Synopsis

$ xvc file --help
File and directory management commands

Usage: xvc file [OPTIONS] <COMMAND>

Commands:
  track     Add file and directories to Xvc
  hash      Get digest hash of files with the supported algorithms
  recheck   Get files from cache by copy or *link
  carry-in  Carry (commit) changed files to cache
  copy      Copy from source to another location in the workspace
  move      Move files to another location in the workspace
  list      List tracked and untracked elements in the workspace
  send      Send (push, upload) files to external storages
  bring     Bring (download, pull, fetch) files from external storages
  remove    Remove files from Xvc and possibly storages
  untrack   Untrack (delete) files from Xvc and possibly storages
  share     Share a file from S3 compatible storage for a limited time
  help      Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...         Verbosity level. Use multiple times to increase command output detail
      --quiet              Suppress error messages
  -C <WORKDIR>             Set the working directory to run the command as if it's in that directory [default: .]
  -c, --config <CONFIG>    Configuration options set from the command line in the form section.key=value
      --no-system-config   Ignore system config file
      --no-user-config     Ignore user config file
      --no-project-config  Ignore project config (.xvc/config)
      --no-local-config    Ignore local config (.xvc/config.local)
      --no-env-config      Ignore configuration options from the environment
  -h, --help               Print help
  -V, --version            Print version

Subcommands

  • track: Track (add) files with Xvc
  • recheck: Copy/link files in the cache to the workspace (checkout)
  • carry-in: Carry-in (commit) changed files to cache
  • copy: Copy files to another location in the workspace
  • move: Move files to another location in the workspace
  • list: List tracked files
  • send: Send (push
  • ) files to storage
  • bring: Bring (pull) files from storage
  • hash: Calculate hashes with supported algorithms similar to sha256sum, blake2sum, etc.
  • remove: Remove files from Xvc cache or storages
  • untrack: Untrack (delete) files from Xvc

xvc file track

Purpose

xvc file track is used to register any kind of file to Xvc for tracking versions.

Synopsis

$ xvc file track --help
Add file and directories to Xvc

Usage: xvc file track [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to track

Options:
      --recheck-method <RECHECK_METHOD>
          How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --no-commit
          Do not copy/link added files to the file cache

      --text-or-binary <TEXT_OR_BINARY>
          Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)

      --force
          Add targets even if they are already tracked

      --no-parallel
          Don't use parallelism

  -h, --help
          Print help (see a summary with '-h')

Examples

File tracking works only in Xvc repositories.

$ git init
...
$ xvc init

Let's create a directory tree for these examples.

$ xvc-test-helper create-directory-tree --directories 4 --files 3  --seed 20231021
$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
├── dir-0002
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
├── dir-0003
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
└── dir-0004
    ├── file-0001.bin
    ├── file-0002.bin
    └── file-0003.bin

5 directories, 12 files

By default, the command runs similar to git add and git commit.

You can track individual files.

$ xvc file track dir-0001/file-0001.bin

You can track directories with the same command.

$ xvc file track dir-0002/

You can specify more than one target in a single command.

$ xvc file track dir-0001/file-0002.bin dir-0001/file-0003.bin

When you track a file, Xvc moves the file to the cache directory under .xvc/ and connects the workspace file with the cached file. This connection is called rechecking and analogous to Git checkout. For example, the above commands create a directory tree under .xvc as follows:

$ tree .xvc/b3
.xvc/b3
├── 493
│   └── eeb
│       └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│           └── 0.bin
├── ab3
│   └── 619
│       └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│           └── 0.bin
└── e51
    └── 7d6
        └── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
            └── 0.bin

10 directories, 3 files

There are different recheck (checkout) methods that Xvc connects the workspace file to the cache. The default method for this is copying the file to the workspace. This way a separate copy of the cache file is created in the workspace.

If you want to make this connection with symbolic links, you can specify it with --recheck-method option.

$ xvc file track --recheck-method symlink dir-0003/file-0001.bin
$ ls -l dir-0003/file-0001.bin
lrwxr-xr-x[..] dir-0003/file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin

You can also use --hardlink and --reflink options. Please see xvc file recheck reference for details.

$ xvc file track --recheck-method hardlink dir-0003/file-0002.bin
$ xvc file track --recheck-method reflink dir-0003/file-0003.bin
$ ls -l dir-0003/
total 16
l[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
-[..] file-0002.bin
-[..] file-0003.bin

Info

Note that, unlike DVC that specifies checkout/recheck option repository wide, Xvc lets you specify per file. You can recheck files data files as symbolic links (which are non-writable) and save space and make model files as copies of the cached original and commit (carry-in) every time they change.

When you track a file in Xvc, it's automatically commit (carry-in) to the cache directory. If you want to postpone this operation and don't need a cached copy for a file, you can use --no-commit option. You can later use xvc file carry-in command to move these files to the repository cache.

$ xvc file track --no-commit --recheck-method symlink dir-0004/
$ ls -l dir-0004/
total 24
-rw-r--r--[..] file-0001.bin
-rw-r--r--[..] file-0002.bin
-rw-r--r--[..] file-0003.bin

$ xvc file list dir-0004/
FS        [..] ab361981 ab361981 dir-0004/file-0003.bin
FS        [..] 493eeb65 493eeb65 dir-0004/file-0002.bin
FS        [..] e517d6b9 e517d6b9 dir-0004/file-0001.bin
Total #: 3 Workspace Size:        6006 Cached Size:        6006


You can carry-in (commit) these files to the cache with xvc file carry-in command. Note that, as the files are deduplicated, we need to use --force in carry-in command. This behavior may change in the future.

$ xvc file carry-in --force dir-0004/

$ ls -l dir-0004/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/493/eeb/6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/ab3/619/814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0/0.bin

Xvc deduplicates files in the cache. If you track a file that is already in the cache, it won't be moved to the cache again. It will be copied, linked from the same copy.

$ tree .xvc/b3
.xvc/b3
├── 493
│   └── eeb
│       └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│           └── 0.bin
├── ab3
│   └── 619
│       └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│           └── 0.bin
└── e51
    └── 7d6
        └── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
            └── 0.bin

10 directories, 3 files

Caveats

  • This command doesn't discriminate symbolic links or hardlinks. Links are followed and any broken links may cause errors.

  • Under the hood, Xvc tracks only the files, not directories. Directories are considered as path collections. It doesn't matter if you track a directory or files in it separately.

Technical Details

  • Detecting changes in files and directories employ different kinds of associated digests. If a file has different metadata digest, its content digest is calculated. If file's content digest has changed, the file is considered changed. A directory that contains different set of files, or files with changed content is considered changed.

xvc file untrack

Synopsis

$ xvc file untrack --help
Untrack (delete) files from Xvc and possibly storages

Usage: xvc file untrack [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...  Files/directories to untrack

Options:
      --restore-versions <RESTORE_VERSIONS>
          Restore all versions to a directory before deleting the cache files
  -h, --help
          Print help

Examples

This command removes a file from Xvc tracking and optionally deletes it from the local filesystem, cache, and the storages.

It only works if the file is tracked by Xvc.

$ git init
...

$ xvc init

$ xvc file track 'd*.txt'

$ xvc file list
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


Without any options, it removes the file from Xvc tracking and the cache.

Warning

xvc file untrack doesn't modify the .gitignore files to remove the previously tracked files. You must do it manually if you want to track the file with Git.

$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3

$ git status
On branch [..]
nothing to commit, working tree clean

If you have rechecked the file as symlink or reflink, it will be copied to the workspace.

$ xvc file track data.txt --as symlink

$ lsd -l
lrwxr-xr-x [..] data.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3

$ lsd -l
.rw-rw-rw- [..] data.txt

If there are multiple versions of the file, it removes them all and restores the latest version.

If you want to restore all versions of the file, you can specify a directory to restore them.

$ xvc file track data.txt

$ perl -pi -e 's/a/e/g' data.txt

$ xvc file carry-in data.txt

$ xvc file untrack data.txt --restore-versions data-versions/
[COPY] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt -> [CWD]/data-versions/data-b3-660-2cf-f6a4.txt
[COPY] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data-versions/data-b3-c85-f3e-8108.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3

$ lsd -l data-versions/
.r--r--r-- [..] data-b3-660-2cf-f6a4.txt
.r--r--r-- [..] data-b3-c85-f3e-8108.txt

If multiple paths are pointing to the same cache file (with deduplication), the cache file will not be deleted. In this case, untrack reports other paths pointing to the same cache file. You must untrack all of them to delete the cache file.

$ xvc file track data.txt

$ xvc file copy data.txt data2.txt --as symlink

$ xvc file untrack data.txt
Not deleting b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt (for data.txt) because it's also used by data2.txt

$ tree .xvc/b3/
.xvc/b3/
└── 660
    └── 2cf
        └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
            └── 0.txt

4 directories, 1 file

$ xvc file untrack data2.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3

xvc file list

Synopsis

$ xvc file list --help
List tracked and untracked elements in the workspace

Usage: xvc file list [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to list.
          
          If not supplied, lists all files under the current directory.

Options:
  -f, --format <FORMAT>
          A string for each row of the output table
          
          The following are the keys for each row:
          
          - {{acd8}}:  actual content digest from the workspace file. First 8 digits.
          - {{acd64}}:  actual content digest. All 64 digits.
          - {{aft}}:  actual file type. Whether the entry is a file (F), directory (D),
            symlink (S), hardlink (H) or reflink (R).
          - {{asz}}:  actual size. The size of the workspace file in bytes. It uses MB,
            GB and TB to represent sizes larger than 1MB.
          - {{ats}}:  actual timestamp. The timestamp of the workspace file.
          - {{name}}: The name of the file or directory.
          - {{cst}}:  cache status. One of "=", ">", "<", "X", or "?" to show
            whether the file timestamp is the same as the cached timestamp, newer,
            older, not cached or not tracked.
          - {{rcd8}}:  recorded content digest stored in the cache. First 8 digits.
          - {{rcd64}}:  recorded content digest stored in the cache. All 64 digits.
          - {{rrm}}:  recorded recheck method. Whether the entry is linked to the workspace
            as a copy (C), symlink (S), hardlink (H) or reflink (R).
          - {{rsz}}:  recorded size. The size of the cached content in bytes. It uses
            MB, GB and TB to represent sizes larged than 1MB.
          - {{rts}}:  recorded timestamp. The timestamp of the cached content.
          
          The default format can be set with file.list.format in the config file.

  -s, --sort <SORT>
          Sort criteria.
          
          It can be one of none (default), name-asc, name-desc, size-asc, size-desc, ts-asc, ts-desc.
          
          The default option can be set with file.list.sort in the config file.

      --no-summary
          Don't show total number and size of the listed files.
          
          The default option can be set with file.list.no_summary in the config file.

  -a, --show-dot-files
          Don't hide dot files
          
          If not supplied, hides dot files like .gitignore and .xvcignore

  -h, --help
          Print help (see a summary with '-h')

Examples

For these examples, we'll create a directory tree with five directories, each having a file.

$ xvc-test-helper create-directory-tree --directories 5 --files 5 --seed 20230213

$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0002
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0003
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0004
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
└── dir-0005
    ├── file-0001.bin
    ├── file-0002.bin
    ├── file-0003.bin
    ├── file-0004.bin
    └── file-0005.bin

[..] directories, 25 files

xvc file list command works only in Xvc repositories. As we didn't initialize a repository yet, it reports an error.

$ xvc file list
? 1
[ERROR] File Error: [E2004] Requires xvc repository.
Error: FileError { source: RequiresXvcRepository }

Let's initialize the repository.

$ git init
...

$ xvc init

Now it lists all files and directories.

$ xvc file list --sort name-asc
DX         224 [..]                   dir-0001
FX        2001 [..]          1953f05d dir-0001/file-0001.bin
FX        2002 [..]          7e807161 dir-0001/file-0002.bin
FX        2003 [..]          d2432259 dir-0001/file-0003.bin
FX        2004 [..]          63535612 dir-0001/file-0004.bin
FX        2005 [..]          447933dc dir-0001/file-0005.bin
DX         224 [..]                   dir-0002
FX        2001 [..]          1953f05d dir-0002/file-0001.bin
FX        2002 [..]          7e807161 dir-0002/file-0002.bin
FX        2003 [..]          d2432259 dir-0002/file-0003.bin
FX        2004 [..]          63535612 dir-0002/file-0004.bin
FX        2005 [..]          447933dc dir-0002/file-0005.bin
DX         224 [..]                   dir-0003
FX        2001 [..]          1953f05d dir-0003/file-0001.bin
FX        2002 [..]          7e807161 dir-0003/file-0002.bin
FX        2003 [..]          d2432259 dir-0003/file-0003.bin
FX        2004 [..]          63535612 dir-0003/file-0004.bin
FX        2005 [..]          447933dc dir-0003/file-0005.bin
DX         224 [..]                   dir-0004
FX        2001 [..]          1953f05d dir-0004/file-0001.bin
FX        2002 [..]          7e807161 dir-0004/file-0002.bin
FX        2003 [..]          d2432259 dir-0004/file-0003.bin
FX        2004 [..]          63535612 dir-0004/file-0004.bin
FX        2005 [..]          447933dc dir-0004/file-0005.bin
DX         224 [..]                   dir-0005
FX        2001 [..]          1953f05d dir-0005/file-0001.bin
FX        2002 [..]          7e807161 dir-0005/file-0002.bin
FX        2003 [..]          d2432259 dir-0005/file-0003.bin
FX        2004 [..]          63535612 dir-0005/file-0004.bin
FX        2005 [..]          447933dc dir-0005/file-0005.bin
Total #: 30 Workspace Size:       51195 Cached Size:           0


By default the command hides dotfiles. If you also want to show them, you can use --show-dot-files/-a flag.

$ xvc file list --sort name-asc --show-dot-files
FX        [..] [..]          [..] .gitignore
FX        [..] [..]          [..] .xvcignore
DX         224 [..]                   dir-0001
FX        2001 [..]          1953f05d dir-0001/file-0001.bin
FX        2002 [..]          7e807161 dir-0001/file-0002.bin
FX        2003 [..]          d2432259 dir-0001/file-0003.bin
FX        2004 [..]          63535612 dir-0001/file-0004.bin
FX        2005 [..]          447933dc dir-0001/file-0005.bin
DX         224 [..]                   dir-0002
FX        2001 [..]          1953f05d dir-0002/file-0001.bin
FX        2002 [..]          7e807161 dir-0002/file-0002.bin
FX        2003 [..]          d2432259 dir-0002/file-0003.bin
FX        2004 [..]          63535612 dir-0002/file-0004.bin
FX        2005 [..]          447933dc dir-0002/file-0005.bin
DX         224 [..]                   dir-0003
FX        2001 [..]          1953f05d dir-0003/file-0001.bin
FX        2002 [..]          7e807161 dir-0003/file-0002.bin
FX        2003 [..]          d2432259 dir-0003/file-0003.bin
FX        2004 [..]          63535612 dir-0003/file-0004.bin
FX        2005 [..]          447933dc dir-0003/file-0005.bin
DX         224 [..]                   dir-0004
FX        2001 [..]          1953f05d dir-0004/file-0001.bin
FX        2002 [..]          7e807161 dir-0004/file-0002.bin
FX        2003 [..]          d2432259 dir-0004/file-0003.bin
FX        2004 [..]          63535612 dir-0004/file-0004.bin
FX        2005 [..]          447933dc dir-0004/file-0005.bin
DX         224 [..]                   dir-0005
FX        2001 [..]          1953f05d dir-0005/file-0001.bin
FX        2002 [..]          7e807161 dir-0005/file-0002.bin
FX        2003 [..]          d2432259 dir-0005/file-0003.bin
FX        2004 [..]          63535612 dir-0005/file-0004.bin
FX        2005 [..]          447933dc dir-0005/file-0005.bin
Total #: 32 Workspace Size:       51443 Cached Size:           0


You can also hide the summary below the list to get only the list of files.

$ xvc file list  --sort name-asc --no-summary
DX         224 [..]                   dir-0001
FX        2001 [..]          1953f05d dir-0001/file-0001.bin
FX        2002 [..]          7e807161 dir-0001/file-0002.bin
FX        2003 [..]          d2432259 dir-0001/file-0003.bin
FX        2004 [..]          63535612 dir-0001/file-0004.bin
FX        2005 [..]          447933dc dir-0001/file-0005.bin
DX         224 [..]                   dir-0002
FX        2001 [..]          1953f05d dir-0002/file-0001.bin
FX        2002 [..]          7e807161 dir-0002/file-0002.bin
FX        2003 [..]          d2432259 dir-0002/file-0003.bin
FX        2004 [..]          63535612 dir-0002/file-0004.bin
FX        2005 [..]          447933dc dir-0002/file-0005.bin
DX         224 [..]                   dir-0003
FX        2001 [..]          1953f05d dir-0003/file-0001.bin
FX        2002 [..]          7e807161 dir-0003/file-0002.bin
FX        2003 [..]          d2432259 dir-0003/file-0003.bin
FX        2004 [..]          63535612 dir-0003/file-0004.bin
FX        2005 [..]          447933dc dir-0003/file-0005.bin
DX         224 [..]                   dir-0004
FX        2001 [..]          1953f05d dir-0004/file-0001.bin
FX        2002 [..]          7e807161 dir-0004/file-0002.bin
FX        2003 [..]          d2432259 dir-0004/file-0003.bin
FX        2004 [..]          63535612 dir-0004/file-0004.bin
FX        2005 [..]          447933dc dir-0004/file-0005.bin
DX         224 [..]                   dir-0005
FX        2001 [..]          1953f05d dir-0005/file-0001.bin
FX        2002 [..]          7e807161 dir-0005/file-0002.bin
FX        2003 [..]          d2432259 dir-0005/file-0003.bin
FX        2004 [..]          63535612 dir-0005/file-0004.bin
FX        2005 [..]          447933dc dir-0005/file-0005.bin


Output Format

With the default output format, the first two letters show the path type and recheck method, respectively.

For example, if you track dir-0001 as copy, the first letter is F for the files and D for the directories. The second letter is C for files, meaning the file is a copy of the cached file, and it's X for directories that means they are not in the cache. Similar to Git, Xvc doesn't track only files and directories are considered as collection of files.

$ xvc file track dir-0001/

$ xvc file list dir-0001/
FC        2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC        2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC        2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC        2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC        2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


If you add another set of files as hardlinks to the cached copies, it will print the second letter as H.

$ xvc file track dir-0002/ --recheck-method hardlink

$ xvc file list dir-0002
FH        2005 [..] 447933dc 447933dc dir-0002/file-0005.bin
FH        2004 [..] 63535612 63535612 dir-0002/file-0004.bin
FH        2003 [..] d2432259 d2432259 dir-0002/file-0003.bin
FH        2002 [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH        2001 [..] 1953f05d 1953f05d dir-0002/file-0001.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


Note, as hardlinks are files with the same inode in the file system with alternative paths, they are detected as F.

Symbolic links are typically reported as SS in the first letters. It means they are symbolic links on the file system and their recheck method is also symbolic links.

$ xvc file track dir-0003 --recheck-method symlink

$ xvc file list dir-0003
SS         [..] 447933dc          dir-0003/file-0005.bin
SS         [..] 63535612          dir-0003/file-0004.bin
SS         [..] d2432259          dir-0003/file-0003.bin
SS         [..] 7e807161          dir-0003/file-0002.bin
SS         [..] 1953f05d          dir-0003/file-0001.bin
Total #: 5 Workspace Size:         [..] Cached Size:       10015


Although not all filesystems support it, R represents reflinks.

Globs

You may use globs to list files.

$ xvc file list 'dir-*/*-0001.bin'
FX        2001 [..]          1953f05d dir-0005/file-0001.bin
FX        2001 [..]          1953f05d dir-0004/file-0001.bin
SS         [..] 1953f05d          dir-0003/file-0001.bin
FH        2[..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC        2[..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size:        [..] Cached Size:        2001


Note that all these files are identical. They are cached once, and only one of them takes space in the cache.

You can also use multiple targets as globs.

$ xvc file list '*/*-0001.bin' '*/*-0002.bin'
FX        2002 [..]          7e807161 dir-0005/file-0002.bin
FX        2001 [..]          1953f05d dir-0005/file-0001.bin
FX        2002 [..]          7e807161 dir-0004/file-0002.bin
FX        2001 [..]          1953f05d dir-0004/file-0001.bin
SS        [..] 7e807161          dir-0003/file-0002.bin
SS        [..] 1953f05d          dir-0003/file-0001.bin
FH        [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH        [..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC        [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC        [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 10 Workspace Size:       [..] Cached Size:        4003


Sorting

You may sort xvc file list output by name, by modification time and by file size.

Use --sort option to specify the sort criteria.

$ xvc file list --sort name-desc dir-0001/
FC        2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC        2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC        2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC        2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC        2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


$ xvc file list --sort name-asc dir-0001/
FC        2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
FC        2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC        2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC        2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC        2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


Column Format

You can specify the columns that the command prints.

For example, if you only want to see the file names, use {{name}} as the format string.

The following command sorts all files with their sizes in the workspace, and prints their size and name.

$ xvc file list --format '{{asz}} {{name}}' --sort size-desc dir-0001/
       2005 dir-0001/file-0005.bin
       2004 dir-0001/file-0004.bin
       2003 dir-0001/file-0003.bin
       2002 dir-0001/file-0002.bin
       2001 dir-0001/file-0001.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


If you want to compare the recorded (cached) hashes and actual hashes in the workspace, you can use {{acd}} {{rcd}} {{name}} format string.

$ xvc file list --format '{{acd8}} {{rcd8}} {{name}}' --sort ts-asc dir-0001
1953f05d 1953f05d dir-0001/file-0001.bin
7e807161 7e807161 dir-0001/file-0002.bin
d2432259 d2432259 dir-0001/file-0003.bin
63535612 63535612 dir-0001/file-0004.bin
447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size:       10015 Cached Size:       10015


Info

If {{acd8}} or {{acd64}} is not present in the format string, Xvc doesn't calculate these hashes. If you have large number of files where the default format (that includes actual content hashes) runs slowly, you may customize it to not to include these columns.

If you want to get a quick glimpse of what needs to carried in, or rechecked, you can use cache status {{cst}} column.

$ xvc-test-helper generate-random-file --size 100 dir-0001/a-new-file.bin

$ xvc file list --format '{{cst}} {{name}}' dir-0001/
= dir-0001/file-0005.bin
= dir-0001/file-0004.bin
= dir-0001/file-0003.bin
= dir-0001/file-0002.bin
= dir-0001/file-0001.bin
X dir-0001/a-new-file.bin
Total #: 6 Workspace Size:       10115 Cached Size:       10015


The cache status column shows = for unchanged files in the cache, X for untracked files, > for files that there is newer version in the cache, and < for files that there is a newer version in the workspace. The comparison is done between recorded timestamp and actual timestamp with an accuracy of 1 second.

xvc file hash

Synopsis

$ xvc file hash --help
Get digest hash of files with the supported algorithms

Usage: xvc file hash [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...  Files to process

Options:
  -a, --algorithm <ALGORITHM>
          Algorithm to calculate the hash. One of blake3, blake2, sha2, sha3. All algorithm variants produce 32-bytes digest
      --text-or-binary <TEXT_OR_BINARY>
          For "text" remove line endings before calculating the digest. Keep line endings if "binary". "auto" (default) detects the type by checking 0s in the first 8Kbytes, similar to Git [default: auto]
  -h, --help
          Print help
  -V, --version
          Print version

xvc file recheck

Synopsis

$ xvc file recheck --help
Get files from cache by copy or *link

Usage: xvc file recheck [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to recheck

Options:
      --recheck-method <RECHECK_METHOD>
          How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink support requires "reflink" feature to be enabled and uses copy if the underlying file system doesn't support it.

      --no-parallel
          Don't use parallelism

      --force
          Force even if target exists

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

This command has an alias xvc file checkout if you feel more at home with Git terminology.

Examples

Rechecking is analogous to git checkout. It copies or links a cached file to the workspace.

Let's create an example directory hierarchy as a showcase.

$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 231123
$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
└── dir-0002
    ├── file-0001.bin
    ├── file-0002.bin
    └── file-0003.bin

3 directories, 6 files

Start by tracking files.

$ git init
...
$ xvc init

$ xvc file track dir-*

Once you added the file to the cache, you can delete the workspace copy.

$ rm dir-0001/file-0001.bin
$ lsd -l dir-0001/file-*
total[..]
drwxr-xr-x [..] dir-0001
drwxr-xr-x [..] dir-0002

Then, recheck the file. By default, it makes a copy of the file.

$ xvc file recheck dir-0001/file-0001.bin

$ lsd -l
.rw-rw-rw- [..] data.txt

You can track and recheck complete directories

$ xvc file track dir-0002/
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ ls -l dir-0002/
total 24
-rw-rw-rw-[..] file-0001.bin
-rw-rw-rw-[..] file-0002.bin
-rw-rw-rw-[..] file-0003.bin

You can use glob patterns to recheck files.

$ xvc file track 'dir-*'

You can update the recheck method of a file. Otherwise it will be kept as same before.

$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/ --as symlink
$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin

$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/

$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin

Symlink and hardlinks are read-only. You can recheck as copy to update.

$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
? 1
zsh:1: permission denied: dir-0002/file-0001.bin

$ xvc file recheck dir-0002/file-0001.bin --as copy

$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'

Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.

$ xvc -vv file recheck data.txt --as hardlink

$ ls -l
total[..]
drwxr-xr-x[..] dir-0001
drwxr-xr-x[..] dir-0002

Reflinks are supported by Xvc, but the underlying file system should also support it. Otherwise it uses copy.

$ rm -f data.txt
$ xvc file recheck data.txt --as reflink

The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.

xvc file carry-in

Copies the file changes to cache.

Synopsis

$ xvc file carry-in --help
Carry (commit) changed files to cache

Usage: xvc file carry-in [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to add

Options:
      --text-or-binary <TEXT_OR_BINARY>
          Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)

      --force
          Carry in targets even their content digests are not changed.
          
          This removes the file in cache and re-adds it.

      --no-parallel
          Don't use parallelism

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

Carry in command works with Xvc repositories.

$ git init
...
$ xvc init

We first track a file.

$ xvc file track data.txt

$ xvc file list data.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


We update the file with a command.

$ perl -i -pe 's/a/ee/g' data.txt

$ cat data.txt
Oh, deetee, my, deetee

$ xvc file list data.txt
FC          23 [..] c85f3e81 e37c686a data.txt
Total #: 1 Workspace Size:          23 Cached Size:          19


Note that the size of the file has increased, as we replace each a with an ee.

$ xvc file carry-in data.txt

$ xvc file list data.txt
FC          23 [..] e37c686a e37c686a data.txt
Total #: 1 Workspace Size:          23 Cached Size:          23


xvc file send

Synopsis

$ xvc file send --help
Send (push, upload) files to external storages

Usage: xvc file send [OPTIONS] --storage <STORAGE> [TARGETS]...

Arguments:
  [TARGETS]...  Targets to send/push/upload to storage

Options:
  -s, --storage <STORAGE>  Storage name or guid to send the files
      --force              Force even if the files are already present in the storage
  -h, --help               Print help

xvc file bring

Synopsis

$ xvc file bring --help
Bring (download, pull, fetch) files from external storages

Usage: xvc file bring [OPTIONS] --storage <STORAGE> [TARGETS]...

Arguments:
  [TARGETS]...
          Targets to bring from the storage

Options:
  -s, --storage <STORAGE>
          Storage name or guid to send the files

      --force
          Force even if the files are already present in the workspace

      --no-recheck
          Don't recheck (checkout) after bringing the file to cache.
          
          This makes the command similar to `git fetch` in Git. It just updates the cache, and doesn't copy/link the file to workspace.

      --recheck-as <RECHECK_AS>
          Recheck (checkout) the file in one of the four alternative ways. (See `xvc file recheck`) and [RecheckMethod]

  -h, --help
          Print help (see a summary with '-h')

xvc file move

Synopsis

$ xvc file move --help
Move files to another location in the workspace

Usage: xvc file move [OPTIONS] <SOURCE> <DESTINATION>

Arguments:
  <SOURCE>
          Source file, glob or directory within the workspace.
          
          If the source ends with a slash, it's considered a directory and all files in that directory are copied.
          
          If there are multiple source files, the destination must be a directory.

  <DESTINATION>
          Location we move file(s) to within the workspace.
          
          If this ends with a slash, it's considered a directory and created if it doesn't exist.
          
          If the number of source files is more than one, the destination must be a directory.

Options:
      --recheck-method <RECHECK_METHOD>
          How the destination should be rechecked: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --no-recheck
          Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

This command is used to move a set of files to another location in the workspace.

By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.

xvc file move works only with the tracked files.

$ git init
...
$ xvc init

$ xvc file track data.txt

$ lsd -l
.rw-rw-rw- [..] data.txt

Once you add the file to the cache, you can move the file to another location.

$ xvc file move data.txt data2.txt

$ ls
data2.txt

Xvc can change the destination file's recheck method.

$ xvc file move data2.txt data3.txt --as symlink

$ ls -l
total[..]
lrwxr-xr-x[..] data3.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

You can move files without them being in the workspace if they are in the cache.

$ rm -f data3.txt

$ xvc file move data3.txt data4.txt

$ ls -l
total 0
lrwxr-xr-x[..] data4.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

You can use glob patterns to move multiple files. In this case, the destination must be a directory.

$ xvc file copy data4.txt data5.txt

$ xvc file move d*.txt another-set/ --as hardlink

$ xvc file list another-set/
FH          [..] c85f3e81 c85f3e81 another-set/data5.txt
FH          [..] c85f3e81 c85f3e81 another-set/data4.txt
Total #: 2 Workspace Size:          38 Cached Size:          19


You can also skip rechecking. In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache. They will be listed with xvc file list command.

$ xvc file move another-set/data5.txt data6.txt --no-recheck

$ xvc file list
XH                                 c85f3e81          data6.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data4.txt
DX          96 [..]                   another-set
Total #: 3 Workspace Size:         115 Cached Size:          19


Later, you can recheck them in the workspace.

$ xvc file recheck data6.txt

$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt

xvc file copy

Synopsis

$ xvc file copy --help
Copy from source to another location in the workspace

Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>

Arguments:
  <SOURCE>
          Source file, glob or directory within the workspace.
          
          If the source ends with a slash, it's considered a directory and all files in that directory are copied.
          
          If the number of source files is more than one, the destination must be a directory.

  <DESTINATION>
          Location we copy file(s) to within the workspace.
          
          If the target ends with a slash, it's considered a directory and created if it doesn't exist.
          
          If the number of source files is more than one, the destination must be a directory.

Options:
      --recheck-method <RECHECK_METHOD>
          How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --force
          Force even if target exists

      --no-recheck
          Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace

      --name-only
          When copying multiple files, by default whole path is copied to the destination. This option sets the destination to be created with the file name only

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

This command is used to copy a set of files to another location in the workspace.

By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.

xvc file copy works only with the tracked files.

$ git init
...
$ xvc init

$ xvc file track data.txt

$ lsd -l
.rw-rw-rw- [..] data.txt

Once you add the file to the cache, you can copy the file to another location.

$ xvc file copy data.txt data2.txt

$ ls
data.txt
data2.txt

Note that, multiple copies of the same content don't add up to the cache size.

$ xvc file list data.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


$ xvc file list 'data*'
FC          19 [..] c85f3e81 c85f3e81 data2.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 2 Workspace Size:          38 Cached Size:          19


Xvc can change the destination file's recheck method.

$ xvc file copy data.txt data3.txt --as symlink

$ lsd -l
.rw-rw-rw- [..] data.txt
.rw-rw-rw- [..] data2.txt
lrwxr-xr-x [..] data3.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

You can create views of your data by copying it to another location.

$ xvc file copy 'd*' another-set/ --as hardlink

$ xvc file list another-set/
FH          19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 3 Workspace Size:          57 Cached Size:          19


If the source files you specify are changed, Xvc cancels the copy operation. Please either recheck old versions or carry in new versions.

$ perl -i -pe 's/a/ee/g' data.txt

$ xvc file copy data.txt data5.txt
? 1
[ERROR] File Error: Sources have changed, please carry-in or recheck following files before copying:
data.txt
Error: FileError { source: AnyhowError { source: Sources have changed, please carry-in or recheck following files before copying:
data.txt } }

You can copy files without them being in the workspace if they are in the cache.

$ rm -f data.txt

$ xvc file copy data.txt data6.txt

$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt

You can also skip rechecking. In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache. They will be listed with xvc file list command.

$ xvc file copy data.txt data7.txt --no-recheck

$ ls
another-set
data2.txt
data3.txt
data6.txt

$ xvc file list
XC             [..] c85f3e81          data7.txt
FC          19 [..] c85f3e81 c85f3e81 data6.txt
SS        [..] [..] c85f3e81          data3.txt
FC          19 [..] c85f3e81 c85f3e81 data2.txt
XC             [..] c85f3e81          data.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data.txt
DX         160 [..]                   another-set
Total #: 9 Workspace Size:         [..] Cached Size:          19


Later, you can recheck them to work in the workspace.

$ xvc file recheck data7.txt

$ lsd -l data7.txt
.rw-rw-rw- [..] data7.txt

xvc file remove

Synopsis

$ xvc file remove --help
Remove files from Xvc and possibly storages

Usage: xvc file remove [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to remove

Options:
      --from-cache
          Remove files from cache

      --from-storage <FROM_STORAGE>
          Remove files from storage

      --all-versions
          Remove all versions of the file

      --only-version <ONLY_VERSION>
          Remove only the specified version of the file
          
          Versions are specified with the content hash 123-456-789abcd. Dashes are optional. Prefix must be unique. If the prefix is not unique, the command will fail.

      --force
          Remove the targets even if they are used by other targets (via deduplication)

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

This command deletes files from the Xvc cache or storage. It doesn't remove the file from Xvc tracking.

Tip

If you want to remove a workspace file or link, you can use usual rm command. If the file is tracked and carried in to the cache, you can always recheck it.

This command only works if the file is tracked by Xvc.

$ git init
...

$ xvc init

$ xvc file track 'd*.txt'

$ xvc file list
FC        [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


$ tree .xvc/b3/
.xvc/b3/
└── c85
    └── f3e
        └── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
            └── 0.txt

4 directories, 1 file

If you don't specify either --from-cache or --from-storage, this command does nothing.

$ xvc file remove data.txt
? failed
error: the following required arguments were not provided:
  --from-cache
  --from-storage <FROM_STORAGE>

Usage: xvc file remove --from-cache --from-storage <FROM_STORAGE> <TARGETS>...

For more information, try '--help'.

You can remove the file from the cache. The file is still tracked by Xvc and available in the workspace.

$ xvc file remove --from-cache data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3

$ ls
data.txt

$ ls .xvc/
config.local.toml
config.toml
ec
store

You can carry the missing file from the workspace to the cache. Use --force to overwrite the cache as carry-in doesn't overwrite the cache by default.

$ xvc file carry-in --force data.txt

$ xvc file list
FC         [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


$ tree .xvc/b3/
.xvc/b3/
└── c85
    └── f3e
        └── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
            └── 0.txt

4 directories, 1 file

You can specify a version of a file to delete from the cache. The versions can be specified like 123-456-789abcd. Dashes are optional. The prefix must be unique.

$ perl -pi -e 's/a/e/g' data.txt

$ xvc file carry-in data.txt

$ tree .xvc/b3/
.xvc/b3/
├── 660
│   └── 2cf
│       └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│           └── 0.txt
└── c85
    └── f3e
        └── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
            └── 0.txt

7 directories, 2 files

$ xvc file list
FC         [..] 6602cff6 6602cff6 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


$ xvc file remove --from-cache --only-version c85-f3e data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85

$ tree .xvc/b3/
.xvc/b3/
└── 660
    └── 2cf
        └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
            └── 0.txt

4 directories, 1 file

You can also remove all versions of a file from the cache.

$ xvc-test-helper generate-random-file --seed 0 data.txt

$ xvc file carry-in data.txt

$ rm data.txt

$ xvc-test-helper generate-random-file --seed 1 data.txt

$ xvc file carry-in data.txt

$ tree .xvc/b3/
.xvc/b3/
├── 017
│   └── ad8
│       └── 6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0
│           └── 0.txt
├── 660
│   └── 2cf
│       └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│           └── 0.txt
└── fef
    └── e16
        └── d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152
            └── 0.txt

10 directories, 3 files

$ xvc file remove --from-cache --all-versions data.txt
[DELETE] [CWD]/.xvc/b3/017/ad8/6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0/0.txt
[DELETE] [CWD]/.xvc/b3/017/ad8/6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0
[DELETE] [CWD]/.xvc/b3/017/ad8
[DELETE] [CWD]/.xvc/b3/017
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3/fef/e16/d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152/0.txt
[DELETE] [CWD]/.xvc/b3/fef/e16/d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152
[DELETE] [CWD]/.xvc/b3/fef/e16
[DELETE] [CWD]/.xvc/b3/fef
[DELETE] [CWD]/.xvc/b3

$ ls .xvc/
config.local.toml
config.toml
ec
store

You can use this command to remove cached files from (remote) storages as well.

$ xvc-test-helper generate-random-file --seed 2 data.txt
$ xvc file carry-in data.txt

$ xvc storage new local --name local-storage --path '../local-storage'
$ xvc file send data.txt --to local-storage

$ tree ../local-storage/
../local-storage/
└── [..]
    └── b3
        └── 218
            └── 2b7
                └── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
                    └── 0.txt

6 directories, 1 file

$ xvc file remove data.txt --from-storage local-storage

$ tree ../local-storage/
../local-storage/
└── [..]
    └── b3
        └── 218
            └── 2b7
                └── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881

6 directories, 0 files

Note that, storage delete implementations differ slightly not to remove the directories. This is to avoid unnecessary round trip existence checks.

If multiple paths are pointing to the same cache file (deduplication), the cache file will not be deleted. In this case, remove reports other paths pointing to the same cache file. You must --force delete the cache file.

$ xvc-test-helper generate-random-file --seed 3 data.txt

$ xvc file carry-in data.txt

$ xvc file copy data.txt data2.txt --as symlink
$ xvc file list
SS        [..] [..] 4a2e9d7c          data2.txt
FC        1024 [..] 4a2e9d7c 4a2e9d7c data.txt
Total #: 2 Workspace Size:        [..] Cached Size:        1024


$ xvc file remove --from-cache data.txt
Not deleting b3/4a2/e9d/7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee/0.txt (for data.txt) because it's also used by data2.txt

$ tree .xvc/b3/
.xvc/b3/
├── 218
│   └── 2b7
│       └── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
│           └── 0.txt
└── 4a2
    └── e9d
        └── 7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee
            └── 0.txt

7 directories, 2 files

Data-Model Pipelines

Synopsis

$ xvc pipeline --help
Pipeline management commands

Usage: xvc pipeline [OPTIONS] <COMMAND>

Commands:
  new     Create a new pipeline
  update  Update the name and other attributes of a pipeline
  delete  Delete a pipeline
  run     Run a pipeline
  list    List all pipelines
  dag     Generate a dot or mermaid diagram for the pipeline
  export  Export the pipeline to a YAML or JSON file to edit
  import  Import the pipeline from a file
  step    Step creation, dependency, output commands
  help    Print this message or the help of the given subcommand(s)

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline this command applies to
  -h, --help                           Print help

xvc pipeline new

Synopsis

$ xvc pipeline new --help
Create a new pipeline

Usage: xvc pipeline new [OPTIONS] --pipeline-name <PIPELINE_NAME>

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline this command applies to
  -w, --workdir <WORKDIR>              Default working directory
  -h, --help                           Print help

Examples

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You can create a new pipeline with a name.

$ xvc pipeline new --pipeline-name my-pipeline

By default it will run the commands in the repository root.

$ xvc pipeline list
+-------------+---------+
| Name        | Run Dir |
+=======================+
| default     |         |
|-------------+---------|
| my-pipeline |         |
+-------------+---------+

If you want to define a pipeline specific to a directory, you can set the working directory.

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230215
$ xvc pipeline new --pipeline-name another-pipeline --workdir dir-0001

The pipeline will run the commands in the specified directory.

$ xvc pipeline list
+------------------+----------+
| Name             | Run Dir  |
+=============================+
| default          |          |
|------------------+----------|
| my-pipeline      |          |
|------------------+----------|
| another-pipeline | dir-0001 |
+------------------+----------+

xvc pipeline list

Synopsis

$ xvc pipeline list --help
List all pipelines

Usage: xvc pipeline list

Options:
  -h, --help  Print help

Examples

Please see xvc pipeline new for examples.

xvc pipeline step

Synopsis

$ xvc pipeline step --help
Step creation, dependency, output commands

Usage: xvc pipeline step <COMMAND>

Commands:
  list        List steps in a pipeline
  new         Add a new step
  remove      Remove a step from a pipeline
  update      Update step options
  dependency  Add a dependency to a step
  output      Add an output to a step
  show        Print step configuration
  help        Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc pipeline step new

Purpose

Create a new step in the pipeline.

Synopsis

$ xvc pipeline step new --help
Add a new step

Usage: xvc pipeline step new [OPTIONS] --step-name <STEP_NAME> --command <COMMAND>

Options:
  -s, --step-name <STEP_NAME>  Name of the new step
  -c, --command <COMMAND>      Step command to run
      --when <WHEN>            When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
  -h, --help                   Print help

Examples

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You can create a new step with a name and a command.

$ xvc pipeline step new --step-name hello --command "echo hello"

By default a step will run only if its dependencies have changed. (--when by_dependencies).

If you want to run the command always, regardless of the changes in dependencies, you can set --when to always.

$ xvc pipeline step new --step-name world --command "echo world" --when always

If you want a step to never run, you can set --when to never.

$ xvc pipeline step new --step-name never --command "echo never" --when never

You can update when the step will run with xvc pipeline step update.

You can get the list of steps in the pipeline with export or dag.

$ xvc pipeline export
{
  "name": "default",
  "steps": [
    {
      "command": "echo hello",
      "dependencies": [],
      "invalidate": "ByDependencies",
      "name": "hello",
      "outputs": []
    },
    {
      "command": "echo world",
      "dependencies": [],
      "invalidate": "Always",
      "name": "world",
      "outputs": []
    },
    {
      "command": "echo never",
      "dependencies": [],
      "invalidate": "Never",
      "name": "never",
      "outputs": []
    }
  ],
  "version": 1,
  "workdir": ""
}

xvc pipeline step list

Purpose

List the steps and their commands in a pipeline

Synopsis

$ xvc pipeline step list --help
List steps in a pipeline

Usage: xvc pipeline step list [OPTIONS]

Options:
      --names-only  Show only the names, otherwise print commands as well
  -h, --help        Print help

Examples

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You may want to list the steps of a pipeline and their commands.

$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step new --step-name world --command "echo world" --when always
$ xvc pipeline step list
hello: echo hello (by_dependencies)
world: echo world (always)

It will list the commands and when they will run (always, never, by_dependencies) by default. If you only need the names of steps, you can use --names-only flag.

$ xvc pipeline step list --names-only
hello
world

xvc pipeline step dependency

Purpose

Define a dependency to an existing step in the pipeline.

Synopsis

$ xvc pipeline step dependency --help
Add a dependency to a step

Usage: xvc pipeline step dependency [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>
          Name of the step to add the dependency to

      --generic <GENERICS>
          Add a generic command output as a dependency. Can be used multiple times. Please delimit the command with ' ' to avoid shell expansion

      --url <URLS>
          Add a URL dependency to the step. Can be used multiple times

      --file <FILES>
          Add a file dependency to the step. Can be used multiple times

      --step <STEPS>
          Add a step dependency to a step. Can be used multiple times. Steps are referred with their names

      --glob_items <GLOB_ITEMS>
          Add a glob items dependency to the step.
          
          You can depend on multiple files and directories with this dependency.
          
          The difference between this and the glob option is that this option keeps track of all matching files, but glob only keeps track of the matched files' digest. When you want to use ${XVC_GLOB_ITEMS}, ${XVC_ADDED_GLOB_ITEMS}, or ${XVC_REMOVED_GLOB_ITEMS} environment variables in the step command, use the glob-items dependency. Otherwise, you can use the glob option to save disk space.

      --glob <GLOBS>
          Add a glob dependency to the step. Can be used multiple times.
          
          You can depend on multiple files and directories with this dependency.
          
          The difference between this and the glob-items option is that the glob-items option keeps track of all matching files individually, but this option only keeps track of the matched files' digest. This dependency uses considerably less disk space.

      --param <PARAMS>
          Add a parameter dependency to the step in the form filename.yaml::model.units
          
          The file can be a JSON, TOML, or YAML file. You can specify hierarchical keys like my.dict.key

      --regex_items <REGEX_ITEMS>
          Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
          
          The difference between this and the regex option is that the regex-items option keeps track of all matching lines, but regex only keeps track of the matched lines' digest. When you want to use ${XVC_REGEX_ITEMS}, ${XVC_ADDED_REGEX_ITEMS}, ${XVC_REMOVED_REGEX_ITEMS} environment variables in the step command, use the regex option. Otherwise, you can use the regex-digest option to save disk space.

      --regex <REGEXES>
          Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
          
          The difference between this and the regex option is that the regex option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest.

      --line_items <LINE_ITEMS>
          Add a line dependency in the form filename.txt::123-234
          
          The difference between this and the lines option is that the line-items option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. When you want to use ${XVC_ALL_LINE_ITEMS}, ${XVC_ADDED_LINE_ITEMS}, ${XVC_CHANGED_LINE_ITEMS} options in the step command, use the line option. Otherwise, you can use the lines option to save disk space.

      --lines <LINES>
          Add a line digest dependency in the form filename.txt::123-234
          
          The difference between this and the line-items dependency is that the line option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. If you don't need individual lines to be kept, use this option to save space.

      --sqlite-query <SQLITE_FILE> <SQLITE_QUERY>
          Add a sqlite query dependency to the step with the file and the query. Can be used once.
          
          The step is invalidated when the query run and the result is different from previous runs, e.g. when an aggregate changed or a new row added to a table.

  -h, --help
          Print help (see a summary with '-h')

File Dependencies

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Begin by adding a new step.

$ xvc pipeline step new --step-name file-dependency --command "echo data.txt has changed"

Add a file dependency to the step.

$ xvc pipeline step dependency --step-name file-dependency --file data.txt

When you run the command, it will print data.txt has changed if the file data.txt has changed.

$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed

[DONE] [file-dependency] (echo data.txt has changed)


You can add multiple dependencies to a step with multiple invocations.

$ xvc pipeline step dependency --step-name file-dependency --file data2.txt

A step will run if any of its dependencies have changed.

$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed

[DONE] [file-dependency] (echo data.txt has changed)


By default, they are not run if none of the dependencies have changed.

$ xvc pipeline run

However, if you want to run the step even if none of the dependencies have changed, you can set the --when option to always.

$ xvc pipeline step update --step-name file-dependency --when always

Now the step will run even if none of the dependencies have changed.

$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed

[DONE] [file-dependency] (echo data.txt has changed)


Glob Dependencies

A step can depend on multiple files specified with globs. The difference with this and glob-items dependency is that this one doesn't track the files, and doesn't pass the list of files in environment variables to the command.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Let's create a set of files:

$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023

$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
└── dir-0002
    ├── file-0001.bin
    ├── file-0002.bin
    └── file-0003.bin

3 directories, 6 files

Add a step to say files has changed when the files have changed.

$ xvc pipeline step new --step-name files-changed --command "echo 'Files have changed.'"

$ xvc pipeline step dependency --step-name files-changed --glob 'dir-*/*'

The step is invalidated when a file described by the glob is added, removed or changed.

$ xvc pipeline run
[OUT] [files-changed] Files have changed.

[DONE] [files-changed] (echo 'Files have changed.')


$ xvc pipeline run

When a file is removed from the files described by the glob, the step is invalidated.

$ rm dir-0001/file-0001.bin

$ xvc pipeline run
[OUT] [files-changed] Files have changed.

[DONE] [files-changed] (echo 'Files have changed.')


Regex Dependencies

You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

We'll use a sample CSV file in this example:

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131


Now, let's add a step to the pipeline to count females in the file:

$ xvc pipeline step new --step-name count-females --command "grep -c '\"F\",' people.csv"

These commands are run when the regex dependencies change.

$ xvc pipeline step dependency --step-name count-females --regex 'people.csv:/^.*"F",.*$'

When you run the pipeline initially, the steps are run.

$ xvc pipeline run
[OUT] [count-females] 7

[DONE] [count-females] (grep -c '"F",' people.csv)


When you run the pipeline again, the step is not run because the regex result didn't change.

$ xvc pipeline run

When you add a new female record to the file, the step is run and the command prints the new count.

$ zsh -c "echo '\"Asude\",      \"F\",   12,       55,      110' >> people.csv"

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131

"Asude",      "F",   12,       55,      110

$ xvc pipeline run
[OUT] [count-females] 8

[DONE] [count-females] (grep -c '"F",' people.csv)


Line Dependencies

You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.

When the text in those lines change, the step is invalidated.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

We'll use a sample CSV file in this example:

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131


Let's a step to show the first 10 lines of the file:

$ xvc pipeline step new --step-name print-top-10 --command "head people.csv"

The command is run only when those lines change.

$ xvc pipeline step dependency --step-name print-top-10 --lines 'people.csv::1-10'

When you run the pipeline initially, the step is run.

$ xvc pipeline run
[OUT] [print-top-10] "Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175

[DONE] [print-top-10] (head people.csv)


When you run the pipeline again, the step is not run because the specified lines didn't change.

$ xvc pipeline run

When you change a line from the file, the step is invalidated.

$ perl -i -pe 's/Hank/Ferzan/g' people.csv

Now, when you run the pipeline, it will print the first 10 lines again.

$ xvc pipeline run
[OUT] [print-top-10] "Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Ferzan",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175

[DONE] [print-top-10] (head people.csv)


Glob Items Dependency

A step can depend on multiple files specified with globs. When any of the files change, or a new file is added or removed from the files specified by glob, the step is invalidated.

Unline glob dependency, glob items dependency keeps track of the individual files that belong to a glob. If your command run with the list of files from a glob and you want to track added and removed files, use this. Otherwise if your command for all the files in a glob and don't need to track which files have changed, use the glob dependency.

This one injects ${XVC_ADDED_GLOB_ITEMS}, ${XVC_REMOVED_GLOB_ITEMS}, ${XVC_CHANGED_GLOB_ITEMS} and ${XVC_ALL_GLOB_ITEMS} to the command environment.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Let's create a set of files:

$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023

$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   └── file-0003.bin
└── dir-0002
    ├── file-0001.bin
    ├── file-0002.bin
    └── file-0003.bin

3 directories, 6 files

Add a step to list the added files.

$ xvc pipeline step new --step-name files-changed --command 'echo "### Added Files:\n${XVC_ADDED_GLOB_ITEMS}\n### Removed Files:\n${XVC_REMOVED_GLOB_ITEMS}\n### Changed Files:\n${XVC_CHANGED_GLOB_ITEMS}"'

$ xvc pipeline step dependency --step-name files-changed --glob-items 'dir-*/*'

The step is invalidated when a file described by the glob is added, removed or changed.

$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
dir-0001/file-0001.bin
dir-0001/file-0002.bin
dir-0001/file-0003.bin
dir-0002/file-0001.bin
dir-0002/file-0002.bin
dir-0002/file-0003.bin
### Removed Files:

### Changed Files:


[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")


$ xvc pipeline run

If you add or remove a file from the files specified by the glob, they are printed.

$ rm dir-0001/file-0001.bin

$ xvc pipeline run
[OUT] [files-changed] ### Added Files:

### Removed Files:
dir-0001/file-0001.bin
### Changed Files:


[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")


When you change a file, it's printed in both added and removed files:

$ xvc-test-helper generate-filled-file dir-0001/file-0002.bin

$ xvc pipeline run
[OUT] [files-changed] ### Added Files:

### Removed Files:

### Changed Files:
dir-0001/file-0002.bin

[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")


Regex Item Dependencies

You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.

Unlike regex dependencies, regex item dependencies keep track of the matched items. You can access them with ${XVC_ALL_REGEX_ITEMS}, ${XVC_ADDED_REGEX_ITEMS}, and ${XVC_REMOVED_REGEX_ITEMS} environment variables.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

We'll use a sample CSV file in this example:

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131


Now, let's add steps to the pipeline to count males and females in the file:

$ xvc pipeline step new --step-name new-males --command 'echo "New Males:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step new --step-name new-females --command 'echo "New Females:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step dependency --step-name new-females --step new-males

We also added a step dependency to let the steps run always in the same order.

These commands are run when the following regexes change.

$ xvc pipeline step dependency --step-name new-males --regex-items 'people.csv:/^.*"M",.*$'

$ xvc pipeline step dependency --step-name new-females --regex-items 'people.csv:/^.*"F",.*$'

When you run the pipeline initially, the steps are run.

$ xvc pipeline run
[OUT] [new-males] New Males:
 "Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Luke",       "M",   34,       72,      163
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Quin",       "M",   29,       71,      176

[DONE] [new-males] (echo "New Males:/n ${XVC_ADDED_REGEX_ITEMS}")

[OUT] [new-females] New Females:
 "Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Kate",       "F",   47,       69,      139
"Myra",       "F",   23,       62,       98
"Page",       "F",   31,       67,      135
"Ruth",       "F",   28,       65,      131

[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")


When you run the pipeline again, the steps are not run because the regexes didn't change.

$ xvc pipeline run

When you add a new female record to the file, only the female count step is run.

$ zsh -c "echo '\"Asude\",      \"F\",   12,       55,      110' >> people.csv"

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131

"Asude",      "F",   12,       55,      110

$ xvc pipeline run
[OUT] [new-females] New Females:
 "Asude",      "F",   12,       55,      110

[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")


Line Item Dependencies

You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.

When the text in those lines change, the step is invalidated.

Unlike line dependencies, this dependency type keeps track of the lines in the file. You can use ${XVC_ALL_LINE_ITEMS}, ${XVC_ADDED_LINE_ITEMS}, and ${XVC_REMOVED_LINE_ITEMS} environment variables in the command. Please be aware that for large set of lines, this dependency can take up considerable space to keep track of all lines and if you don't need to keep track of changed lines, you can use --lines dependency.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

We'll use a sample CSV file in this example:

$ cat people.csv
"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131


Let's a step to show the first 10 lines of the file:

$ xvc pipeline step new --step-name print-top-10 --command 'echo "Added Lines:\n ${XVC_ADDED_LINE_ITEMS}\nRemoved Lines:\n${XVC_REMOVED_LINE_ITEMS}"'

The command is run only when those lines change.

$ xvc pipeline step dependency --step-name print-top-10 --line-items 'people.csv::1-10'

When you run the pipeline initially, the step is run.

$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
 "Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
Removed Lines:


[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")


When you run the pipeline again, the step is not run because the specified lines didn't change.

$ xvc pipeline run

When you change a line from the file, the step is invalidated.

$ perl -i -pe 's/Hank/Ferzan/g' people.csv

Now, when you run the pipeline, it will print the changed line, with its new and old versions.

$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
 "Ferzan",       "M",   30,       71,      158
Removed Lines:
"Hank",       "M",   30,       71,      158

[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")


SQLite Query Dependency

You can create a step dependency with an SQLite query. When the query results change, the step is invalidated.

SQLite dependencies doesn't track the results of the query. It just checks whether the query results has changed.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Suppose we have an SQLite database people.db with the following schema and data:

CREATE TABLE People (
    Name TEXT,
    Sex TEXT,
    Age INTEGER,
    Height_in INTEGER,
    Weight_lbs INTEGER
);

INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES
('Alex', 'M', 41, 74, 170),
('Bert', 'M', 42, 68, 166),
('Carl', 'M', 32, 70, 155),
('Dave', 'M', 39, 72, 167),
('Elly', 'F', 30, 66, 124),
('Fran', 'F', 33, 66, 115),
('Gwen', 'F', 26, 64, 121),
('Hank', 'M', 30, 71, 158),
('Ivan', 'M', 53, 72, 175),
('Jake', 'M', 32, 69, 143),
('Kate', 'F', 47, 69, 139),
('Luke', 'M', 34, 72, 163),
('Myra', 'F', 23, 62, 98),
('Neil', 'M', 36, 75, 160),
('Omar', 'M', 38, 70, 145),
('Page', 'F', 31, 67, 135),
('Quin', 'M', 29, 71, 176),
('Ruth', 'F', 28, 65, 131);
EOF

Now, we'll add a step to the pipeline to calculate the average age of these people.

$ xvc pipeline step new --step-name average-age --command "sqlite3 people.db 'SELECT AVG(Age) FROM People;'"

Let's run the step without a dependency first.

$ xvc pipeline run
[OUT] [average-age] 34.6666666666667

[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')


Now, we'll add a dependency to this step and it will only run the step when the results of that query changes.

$ xvc pipeline step dependency --step-name average-age --sqlite-query people.db 'SELECT count(*) FROM People;'

The dependency query is run everytime the pipeline runs. It's expected to be lightweight to avoid performance issues.

So, when the number of people in the table changes, the step will run. Initially it doesn't keep track of the query results, so it will run again.

$ xvc pipeline run
[OUT] [average-age] 34.6666666666667

[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')


But it won't run the step a second time, as the table didn't change.

$ xvc pipeline run

Let's add another row to the table:

$ sqlite3 people.db "INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES ('Asude', 'F', 10, 74, 170);"

This time, the step will run again as the result from dependency query (SELECT count(*) FROM People) changed.

$ xvc pipeline run
[OUT] [average-age] 33.3684210526316

[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')


Xvc opens the database in read-only mode to avoid locking.

(Hyper-)Parameter Dependencies

You may be keeping pipeline-wide parameters in structured text files. You can specify such parameters found in JSON, TOML and YAML files as dependencies.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Suppose we have a YAML file that we specify various parameters for the whole connection.

param: value
database:
  server: example.com
  port: 5432
  connection:
    timeout: 5000
numeric_param: 13

Now, we create two steps to read different variables from the file and a dependency between them to force them to run in the same order always.

$ xvc pipeline step new --step-name read-database-config --command 'echo "Updated Database Configuration"'

$ xvc pipeline step new --step-name read-hyperparams --command 'echo "Update Hyperparameters"'

$ xvc pipeline step dependency --step-name read-database-config --step read-hyperparams

Let's create different steps for various pieces of this parameters file:

$ xvc pipeline step dependency --step-name read-database-config --param 'myparams.yaml::database.port' --param 'myparams.yaml::database.server' --param 'myparams.yaml::database.connection'

$ xvc pipeline step dependency --step-name read-hyperparams --param 'myparams.yaml::param' --param 'myparams.yaml::numeric_param'

Run for the first time, as initially all dependencies are invalid:

$ xvc pipeline run
[OUT] [read-hyperparams] Update Hyperparameters

[DONE] [read-hyperparams] (echo "Update Hyperparameters")

[OUT] [read-database-config] Updated Database Configuration

[DONE] [read-database-config] (echo "Updated Database Configuration")


For the second time, it won't read the configuration as nothing is changed:

$ xvc pipeline run

When you update a value in this file, it will only invalidate the steps that depend on the value, not other dependencies that rely on the same file.

Let's update the database port:

$ perl -pi -e 's/5432/9876/g' myparams.yaml

$ xvc pipeline run
[OUT] [read-database-config] Updated Database Configuration

[DONE] [read-database-config] (echo "Updated Database Configuration")


Note that, read-hyperparams is not invalidated, though the values are in the same file.

Step Dependencies

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You can add a step dependency to a step. These steps specify dependency relationships explicitly, without relying on changed files or directories.

$ xvc pipeline step new --step-name world --command "echo world"
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step dependency --step-name world --step hello

When run, the dependency will be run first and the step will be run after.

$ xvc pipeline run
[OUT] [hello] hello

[DONE] [hello] (echo hello)

[OUT] [world] world

[DONE] [world] (echo world)


If the dependency is not run, the dependent step won't run either.

$ xvc pipeline step update --step-name hello --when never
$ xvc pipeline run

If you want to run the dependent always, you can set it to run always explicitly.

$ xvc pipeline step update --step-name world --when always
$ xvc pipeline run
[OUT] [world] world

[DONE] [world] (echo world)


URL Dependencies

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You can use a web URL as a dependency to a step. When the URL is fetched, the output hash is saved to compare and the step is invalidated when the output of the URL is changed.

You can use this with any URL.

$ xvc pipeline step new --step-name xvc-docs-update --command "echo 'Xvc docs updated!'"

$ xvc pipeline step dependency --step-name xvc-docs-update --url https://docs.xvc.dev/

The step is invalidated when the page is updated.

$ xvc pipeline run
[OUT] [xvc-docs-update] Xvc docs updated!

[DONE] [xvc-docs-update] (echo 'Xvc docs updated!')


The step won't run again until a new version of the page is published.

$ xvc pipeline run

Note that, Xvc doesn't download the page every time. It checks the Last-Modified and Etag headers and only downloads the page if it has changed.

If there are more complex requirements than just the URL changing, you can use a generic dependency to get the output of a command and use that as a dependency.

Generic Command Dependencies

This command works only in Xvc repositories.

$ git init
...
$ xvc init

You can use the output of a shell command as a dependency to a step. When the command is run, the output hash is saved to compare and the step is invalidated when the output of the command changed.

You can use this for any command that outputs a string.

$ xvc pipeline step new --step-name morning-message --command "echo 'Good Morning!'"

$ xvc  pipeline step dependency --step-name morning-message --generic 'date +%F'

The step is invalidated when the date changes and the step is run again.

$ xvc pipeline run
[OUT] [morning-message] Good Morning!
 
[DONE] morning-message (echo 'Good Morning!')

The step won't run until tomorrow, when date +%F changes.

$ xvc pipeline run
[OUT] [morning-message] Good Morning!

[DONE] [morning-message] (echo 'Good Morning!')


You can mimic all kinds of pipeline behavior with this generic dependency.

For example, if you want to run a command when directory contents change, you can depend on the output of ls -lR:

$ xvc pipeline step new --step-name directory-contents --command "echo 'Files changed'"
$ xvc pipeline step dependency --step-name directory-contents --generic 'ls'

$ xvc pipeline run
[OUT] [directory-contents] Files changed

[DONE] [directory-contents] (echo 'Files changed')


When you add a file to the directory, the step is invalidated and run again:

$ xvc pipeline run

$ xvc-test-helper generate-random-file new-file.txt
$ xvc pipeline run
[OUT] [directory-contents] Files changed

[DONE] [directory-contents] (echo 'Files changed')


Caveats

Tips

Most shells support editing longer commands with an editor. For bash, you can use Ctrl+X Ctrl+E.

Pipeline commands can get longer quickly. You can use xvc aliases for shorter versions. Type source $(xvc aliases) to load the aliases into your shell.

xvc pipeline step output

Purpose

Define an output (file, metrics or plots) to an already existing step in the pipeline.

Synopsis

$ xvc pipeline step output --help
Add an output to a step

Usage: xvc pipeline step output [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>    Name of the step to add the output to
      --output-file <FILES>      Add a file output to the step. Can be used multiple times
      --output-metric <METRICS>  Add a metric output to the step. Can be used multiple times
      --output-image <IMAGES>    Add an image output to the step. Can be used multiple times
  -h, --help                     Print help

Examples

Caveats

xvc pipeline step show

Purpose

Print the steps of a pipeline.

Synopsis

$ xvc pipeline step show --help
Print step configuration

Usage: xvc pipeline step show --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the step to show
  -h, --help                   Print help

Examples

Caveats

xvc pipeline step update

Purpose

Update the name, running condition, or command of a step.

Synopsis

$ xvc pipeline step update --help
Update step options

Usage: xvc pipeline step update [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the step to update. The step should already be defined
  -c, --command <COMMAND>      Step command to run
      --when <WHEN>            When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
  -h, --help                   Print help

Examples

Caveats

xvc pipeline step remove

Purpose

Remove a step and all its dependencies and outputs from the pipeline.

Synopsis

$ xvc pipeline step remove --help
Remove a step from a pipeline

Usage: xvc pipeline step remove --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the step to remove
  -h, --help                   Print help

Examples

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Let's create a few steps and make them depend on each other.

$ xvc pipeline step new --step-name hello --command 'echo hello >> hello.txt'

$ xvc pipeline step new --step-name world --command 'echo world >> world.txt'

$ xvc pipeline step new --step-name from --command 'echo from >> from.txt'

$ xvc pipeline step new --step-name xvc --command 'echo xvc >> xvc.txt'

Let's specify the outputs as well.

$ xvc pipeline step output --step-name hello --output-file hello.txt

$ xvc pipeline step output --step-name world --output-file world.txt

$ xvc pipeline step output --step-name from --output-file from.txt

$ xvc pipeline step output --step-name xvc --output-file xvc.txt

Now we can add dependencies between them.

$ xvc pipeline step dependency --step-name xvc --step from

$ xvc pipeline step dependency --step-name from --file world.txt

$ xvc pipeline step dependency --step-name world --step hello

Now the pipeline looks like this:

$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
from: echo from >> from.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)

$ xvc pipeline dag --format mermaid
flowchart TD
    n0["hello"]
    n1["hello.txt"] --> n0
    n2["world"]
    n0["hello"] --> n2
    n3["world.txt"] --> n2
    n4["from"]
    n3["world.txt"] --> n4
    n5["from.txt"] --> n4
    n6["xvc"]
    n4["from"] --> n6
    n7["xvc.txt"] --> n6


When we remove a step, all its dependencies and outputs are removed as well.

$ xvc -vv pipeline step remove --step-name from
[INFO] Removing dep: file(world.txt)
[INFO] Removing dep step(from) from xvc
[INFO] Removing output: File
[INFO] Removing step: from

$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)

$ xvc pipeline dag --format mermaid
flowchart TD
    n0["hello"]
    n1["hello.txt"] --> n0
    n2["world"]
    n0["hello"] --> n2
    n3["world.txt"] --> n2
    n4["xvc"]
    n5["xvc.txt"] --> n4


xvc pipeline run

Synopsis

$ xvc pipeline run --help
Run a pipeline

Usage: xvc pipeline run [OPTIONS]

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline to run
  -h, --help                           Print help

Examples

Pipelines require Xvc to be initialized before running.

$ git init
...
$ xvc init

Xvc defines a default pipeline and any steps added without specifying the pipeline will be added to it.

$ xvc pipeline list
+---------+---------+
| Name    | Run Dir |
+===================+
| default |         |
+---------+---------+

Create a new step in this pipeline with xvc pipeline step new command.

$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline dag --format=mermaid
flowchart TD
    n0["hello"]


You can run the default pipeline without specifying its name.

$ xvc pipeline run
[OUT] [hello] hello

[DONE] [hello] (echo hello)


Note that, when a step has no dependencies, it's set to always run if it's not set to run never explicitly.

$ xvc pipeline step update --step-name hello --when never

$ xvc pipeline run

Run a specific pipeline

You can run a specific pipeline by specifying its name with --name option.

$ xvc pipeline new --pipeline-name my-pipeline

$ xvc pipeline --pipeline-name my-pipeline step new --step-name my-hello --command "echo 'hello from my-pipeline'"

$ xvc pipeline run --pipeline-name my-pipeline
[OUT] [my-hello] hello from my-pipeline

[DONE] [my-hello] (echo 'hello from my-pipeline')


xvc pipeline delete

Synopsis

$ xvc pipeline delete --help
Delete a pipeline

Usage: xvc pipeline delete --pipeline-name <PIPELINE_NAME>

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name or GUID of the pipeline to be deleted
  -h, --help                           Print help

xvc pipeline export

Synopsis

$ xvc pipeline export --help
Export the pipeline to a YAML or JSON file to edit

Usage: xvc pipeline export [OPTIONS]

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline to export
      --file <FILE>                    File to write the pipeline. Writes to stdout if not set
      --format <FORMAT>                Output format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
  -h, --help                           Print help

Examples

You can export the pipeline you created to a JSON or YAML file to edit and restore using xvc pipeline import. This allows to fix typos and update commands in place, and see pipeline internals for debugging.

Warning

Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

Let's start by defining a steps in the pipeline.

$ xvc pipeline step new --step-name step1 --command 'touch abc.txt'
$ xvc pipeline step new --step-name step2 --command 'touch def.txt'

Adding a few dependencies.

$ xvc pipeline step dependency -s step2 --step step1
$ xvc pipeline step dependency -s step2 --glob '*.txt'
$ xvc pipeline step dependency -s step2 --glob-items '*.txt'

$ xvc pipeline step dependency -s step2 --param model.conv_units
$ xvc pipeline step dependency -s step2 --regex requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --regex-items requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --line-items params.yaml::1-20
$ xvc pipeline step dependency -s step2 --lines params.yaml::1-20
$ xvc pipeline step dependency -s step2 --url 'https://example.com'
$ xvc pipeline step dependency -s step2 --generic 'ping -c 2 example.com'
$ xvc pipeline step output -s step2 --output-metric metrics.json
$ xvc pipeline step output -s step2 --output-file def.txt
$ xvc pipeline step output -s step2 --output-image plots/confusion.png

If you don't specify a filename, the default format is JSON and the output will be sent to stdout.

$ xvc pipeline export
{
  "name": "default",
  "steps": [
    {
      "command": "touch abc.txt",
      "dependencies": [],
      "invalidate": "ByDependencies",
      "name": "step1",
      "outputs": []
    },
    {
      "command": "touch def.txt",
      "dependencies": [
        {
          "Step": {
            "name": "step1"
          }
        },
        {
          "Generic": {
            "generic_command": "ping -c 2 example.com",
            "output_digest": null
          }
        },
        {
          "GlobItems": {
            "glob": "*.txt",
            "xvc_path_content_digest_map": {},
            "xvc_path_metadata_map": {}
          }
        },
        {
          "Glob": {
            "content_digest": null,
            "glob": "*.txt",
            "xvc_metadata_digest": null,
            "xvc_paths_digest": null
          }
        },
        {
          "RegexItems": {
            "lines": [],
            "path": "requirements.txt",
            "regex": "^tensorflow",
            "xvc_metadata": null
          }
        },
        {
          "Regex": {
            "lines_digest": null,
            "path": "requirements.txt",
            "regex": "^tensorflow",
            "xvc_metadata": null
          }
        },
        {
          "Param": {
            "format": "YAML",
            "key": "model.conv_units",
            "path": "params.yaml",
            "value": null,
            "xvc_metadata": null
          }
        },
        {
          "LineItems": {
            "begin": 1,
            "end": 20,
            "lines": [],
            "path": "params.yaml",
            "xvc_metadata": null
          }
        },
        {
          "Lines": {
            "begin": 1,
            "digest": null,
            "end": 20,
            "path": "params.yaml",
            "xvc_metadata": null
          }
        },
        {
          "UrlDigest": {
            "etag": null,
            "last_modified": null,
            "url": "https://example.com/",
            "url_content_digest": null
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "step2",
      "outputs": [
        {
          "File": {
            "path": "def.txt"
          }
        },
        {
          "Metric": {
            "format": "JSON",
            "path": "metrics.json"
          }
        },
        {
          "Image": {
            "path": "plots/confusion.png"
          }
        }
      ]
    }
  ],
  "version": 1,
  "workdir": ""
}

If you want to set the format, you can specify the --format option.

$ xvc pipeline export --format yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
  command: touch abc.txt
  invalidate: ByDependencies
  dependencies: []
  outputs: []
- name: step2
  command: touch def.txt
  invalidate: ByDependencies
  dependencies:
  - !Step
    name: step1
  - !Generic
    generic_command: ping -c 2 example.com
    output_digest: null
  - !GlobItems
    glob: '*.txt'
    xvc_path_metadata_map: {}
    xvc_path_content_digest_map: {}
  - !Glob
    glob: '*.txt'
    xvc_paths_digest: null
    xvc_metadata_digest: null
    content_digest: null
  - !RegexItems
    path: requirements.txt
    regex: ^tensorflow
    lines: []
    xvc_metadata: null
  - !Regex
    path: requirements.txt
    regex: ^tensorflow
    lines_digest: null
    xvc_metadata: null
  - !Param
    format: YAML
    path: params.yaml
    key: model.conv_units
    value: null
    xvc_metadata: null
  - !LineItems
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    lines: []
  - !Lines
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    digest: null
  - !UrlDigest
    url: https://example.com/
    etag: null
    last_modified: null
    url_content_digest: null
  outputs:
  - !File
    path: def.txt
  - !Metric
    path: metrics.json
    format: JSON
  - !Image
    path: plots/confusion.png


When you specify a file name, the output format is inferred from the extension.

$ xvc pipeline export --file pipeline.yaml

$ cat pipeline.yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
  command: touch abc.txt
  invalidate: ByDependencies
  dependencies: []
  outputs: []
- name: step2
  command: touch def.txt
  invalidate: ByDependencies
  dependencies:
  - !Step
    name: step1
  - !Generic
    generic_command: ping -c 2 example.com
    output_digest: null
  - !GlobItems
    glob: '*.txt'
    xvc_path_metadata_map: {}
    xvc_path_content_digest_map: {}
  - !Glob
    glob: '*.txt'
    xvc_paths_digest: null
    xvc_metadata_digest: null
    content_digest: null
  - !RegexItems
    path: requirements.txt
    regex: ^tensorflow
    lines: []
    xvc_metadata: null
  - !Regex
    path: requirements.txt
    regex: ^tensorflow
    lines_digest: null
    xvc_metadata: null
  - !Param
    format: YAML
    path: params.yaml
    key: model.conv_units
    value: null
    xvc_metadata: null
  - !LineItems
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    lines: []
  - !Lines
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    digest: null
  - !UrlDigest
    url: https://example.com/
    etag: null
    last_modified: null
    url_content_digest: null
  outputs:
  - !File
    path: def.txt
  - !Metric
    path: metrics.json
    format: JSON
  - !Image
    path: plots/confusion.png

xvc pipeline import

Synopsis

$ xvc pipeline import --help
Import the pipeline from a file

Usage: xvc pipeline import [OPTIONS]

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline to import. If not set, the name from the file is used
      --file <FILE>                    File to read the pipeline. Use stdin if not specified
      --format <FORMAT>                Input format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
      --overwrite                      Overwrite the pipeline even if the name already exists
  -h, --help                           Print help

Examples

This command is used to import pipelines exported with xvc pipeline export.

You can edit and import the pipelines exported with the command.

Warning

Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.

This command works only in Xvc repositories.

$ git init
...
$ xvc init

The following file generated with xvc pipeline export.

$ cat pipeline.yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
  command: touch abc.txt
  invalidate: ByDependencies
  dependencies: []
  outputs: []
- name: step2
  command: touch def.txt
  invalidate: ByDependencies
  dependencies:
  - !Step
    name: step1
  - !Generic
    generic_command: ping -c 2 example.com
    output_digest: null
  - !GlobItems
    glob: '*.txt'
    xvc_path_metadata_map: {}
    xvc_path_content_digest_map: {}
  - !Glob
    glob: '*.txt'
    xvc_paths_digest: null
    xvc_metadata_digest: null
    content_digest: null
  - !RegexItems
    path: requirements.txt
    regex: ^tensorflow
    lines: []
    xvc_metadata: null
  - !Regex
    path: requirements.txt
    regex: ^tensorflow
    lines_digest: null
    xvc_metadata: null
  - !Param
    format: YAML
    path: params.yaml
    key: model.conv_units
    value: null
    xvc_metadata: null
  - !LineItems
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    lines: []
  - !Lines
    path: params.yaml
    begin: 1
    end: 20
    xvc_metadata: null
    digest: null
  - !UrlDigest
    url: https://example.com/
    etag: null
    last_modified: null
    url_content_digest: null
  outputs:
  - !File
    path: def.txt
  - !Metric
    path: metrics.json
    format: JSON
  - !Image
    path: plots/confusion.png

You can import this file to construct the pipeline at once. Note that the export command outputs JSON by default.

$ xvc pipeline import --file pipeline.yaml --overwrite

$ xvc pipeline export
{
  "name": "default",
  "steps": [
    {
      "command": "touch abc.txt",
      "dependencies": [],
      "invalidate": "ByDependencies",
      "name": "step1",
      "outputs": []
    },
    {
      "command": "touch def.txt",
      "dependencies": [
        {
          "Step": {
            "name": "step1"
          }
        },
        {
          "Generic": {
            "generic_command": "ping -c 2 example.com",
            "output_digest": null
          }
        },
        {
          "GlobItems": {
            "glob": "*.txt",
            "xvc_path_content_digest_map": {},
            "xvc_path_metadata_map": {}
          }
        },
        {
          "Glob": {
            "content_digest": null,
            "glob": "*.txt",
            "xvc_metadata_digest": null,
            "xvc_paths_digest": null
          }
        },
        {
          "RegexItems": {
            "lines": [],
            "path": "requirements.txt",
            "regex": "^tensorflow",
            "xvc_metadata": null
          }
        },
        {
          "Regex": {
            "lines_digest": null,
            "path": "requirements.txt",
            "regex": "^tensorflow",
            "xvc_metadata": null
          }
        },
        {
          "Param": {
            "format": "YAML",
            "key": "model.conv_units",
            "path": "params.yaml",
            "value": null,
            "xvc_metadata": null
          }
        },
        {
          "LineItems": {
            "begin": 1,
            "end": 20,
            "lines": [],
            "path": "params.yaml",
            "xvc_metadata": null
          }
        },
        {
          "Lines": {
            "begin": 1,
            "digest": null,
            "end": 20,
            "path": "params.yaml",
            "xvc_metadata": null
          }
        },
        {
          "UrlDigest": {
            "etag": null,
            "last_modified": null,
            "url": "https://example.com/",
            "url_content_digest": null
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "step2",
      "outputs": [
        {
          "File": {
            "path": "def.txt"
          }
        },
        {
          "Metric": {
            "format": "JSON",
            "path": "metrics.json"
          }
        },
        {
          "Image": {
            "path": "plots/confusion.png"
          }
        }
      ]
    }
  ],
  "version": 1,
  "workdir": ""
}

If you don't supply the --overwrite option, Xvc will report an error and quit.

$ xvc pipeline import --file pipeline.yaml
? 1
[ERROR] Pipeline Error: Pipeline default already found
Error: PipelineError { source: PipelineAlreadyFound { name: "default" } }

You can specify a new name for the pipeline and it will override the name set in the file. This way you can edit and import similar pipelines with minor differences.

$ xvc pipeline import --pipeline-name another-pipeline --file pipeline.yaml

You can also use stdin to import a pipeline but you must specify the input format.

xvc pipeline update

Synopsis

$ xvc pipeline update --help
Update the name and other attributes of a pipeline

Usage: xvc pipeline update [OPTIONS]

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline this command applies to
      --rename <RENAME>                Rename the pipeline to
      --workdir <WORKDIR>              Set the working directory
      --set-default                    set this pipeline default
  -h, --help                           Print help

xvc pipeline dag

Synopsis

$ xvc pipeline dag --help
Generate a dot or mermaid diagram for the pipeline

Usage: xvc pipeline dag [OPTIONS]

Options:
  -p, --pipeline-name <PIPELINE_NAME>  Name of the pipeline to generate the diagram
      --file <FILE>                    Output file. Writes to stdout if not set
      --format <FORMAT>                Format for graph. Either dot or mermaid [default: dot]
  -h, --help                           Print help

You can visualize the pipeline you defined with xvc pipeline set of command with the xvc pipeline dag command. It will generate a dot or mermaid diagram for the pipeline.

Examples

As all other pipeline commands, this requires an Xvc repository.

$ git init --initial-branch=main
Initialized empty Git repository in [CWD]/.git/

$ xvc init

All steps of the pipeline are shown as nodes in the graph.

We create a dependency between the two steps by using the --dependencies flag to make them run sequentially.

$ xvc pipeline step new --step-name preprocess --command "echo 'preprocess'"

$ xvc pipeline step new --step-name train --command "echo 'train'"

$ xvc pipeline step dependency --step-name train --step preprocess

It's not very readable but you can supply the result directly to dot and get a more useful output.

$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n1;}

The output after dot -Tsvg is:

pipeline-1

When you add a dependency between two steps, the graph shows it as a node. For example,

$ xvc pipeline step dependency --step-name preprocess --glob 'data/*'

$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=folder;label="data/*";];n1->n0;n2[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n2;}

pipeline-2

You can use --mermaid option to get a mermaid.js diagram.

$ xvc pipeline dag --format=mermaid
flowchart TD
    n0["preprocess"]
    n1["data/*"] --> n0
    n2["train"]
    n0["preprocess"] --> n2


The output can be used in Mermaid Live Editor or any web page that support the format.

flowchart TD
    n0["train"]
    n1["preprocess"] --> n0
    n1["preprocess"]
    n2["data/*"] --> n1

Storage management commands (xvc storage)

Purpose

Xvc allows to keep tracked content in storages. These can be in either local file system or the cloud. xvc storage set of commands allow to configure, list and delete these storages.

Synopsis

$ xvc storage --help
Storage (cloud) management commands

Usage: xvc storage <COMMAND>

Commands:
  list    List all configured storages
  remove  Remove a storage configuration
  new     Configure a new storage
  help    Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc storage list

Purpose

List all configured storages with their names and guids.

Synopsis

$ xvc storage list --help
List all configured storages

Usage: xvc storage list

Options:
  -h, --help  Print help

Examples

List all storage configurations in the repository:

$ xvc storage list

Caveats

This one uses the local configuration and doesn't try to connect storages. If it's listed with the command, it doesn't mean it's guaranteed to be able to pull or push.

xvc storage remove

Purpose

Remove unused or inaccessible storages from the configuration

Synopsis

$ xvc storage remove --help
Remove a storage configuration.

This doesn't delete any files in the storage.

Usage: xvc storage remove --name <NAME>

Options:
      --name <NAME>
          Name of the storage to be deleted

  -h, --help
          Print help (see a summary with '-h')

Caveats

xvc storage new

Synopsis

$ xvc storage new --help 
Configure a new storage

Usage: xvc storage new <COMMAND>

Commands:
  local          Add a new local storage
  generic        Add a new generic storage
  rsync          Add a new rsync storages
  s3             Add a new S3 storage
  minio          Add a new Minio storage
  digital-ocean  Add a new Digital Ocean storage
  r2             Add a new R2 storage
  gcs            Add a new Google Cloud Storage storage
  wasabi         Add a new Wasabi storage
  help           Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc storage new local

Purpose

Create a new storage reachable from the local filesystem. It allows to keep tracked file contents in a different directory for backup or sharing purposes.

Synopsis

$ xvc storage new local --help
Add a new local storage

A local storage is a directory accessible from the local file system. Xvc will use common file operations for this directory without accessing the network.

Usage: xvc storage new local --path <PATH> --name <NAME>

Options:
      --path <PATH>
          Directory (outside the repository) to be set as a storage

  -n, --name <NAME>
          Name of the storage.
          
          Recommended to keep this name unique to refer easily.

  -h, --help
          Print help (see a summary with '-h')

Examples

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

Now, you can define a local directory as storage and begin to use it.

$ xvc storage new local --name backup --path '../my-local-storage'

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa
[DELETE] [CWD]/.xvc/b3/3c6/70f
[DELETE] [CWD]/.xvc/b3/3c6
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d
[DELETE] [CWD]/.xvc/b3/7aa/354
[DELETE] [CWD]/.xvc/b3/7aa
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb
[DELETE] [CWD]/.xvc/b3/d7d/629
[DELETE] [CWD]/.xvc/b3/d7d
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

Caveats

--name NAME is not checked to be unique but you should use unique storage names to refer them later.

--path PATH should be accessible for writing and shouldn't already exist.

Technical Details

The command creates the PATH and a new file under PATH called .xvc-guid. The file contains the unique identifier for this storage. The same identifier is also recorded to the project.

A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}} is saved to PATH/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}. {{REPO_ID}} is the unique identifier for the repository created during xvc init. Hence if you use a common storage for different Xvc projects, their files are kept under different directories. There is no inter-project deduplication. (yet)

In the future, there may be an option to have a common storage for multiple projects at the same location. Please comment below if this is a common use case.

xvc storage new generic

Purpose

Create a new storage that uses shell commands to send and retrieve cache files. It allows to keep tracked files in any kind of service that can be used command line.

Synopsis

$ xvc storage new generic --help
Add a new generic storage.

⚠️ Please note that this is an advanced method to configure storages. You may damage your repository and local and storage files with incorrect configurations.

Please see https://docs.xvc.dev/ref/xvc-storage-new-generic.html for examples and make necessary backups.

Usage: xvc storage new generic [OPTIONS] --name <NAME> --init <INIT_COMMAND> --list <LIST_COMMAND> --download <DOWNLOAD_COMMAND> --upload <UPLOAD_COMMAND> --delete <DELETE_COMMAND>

Options:
  -n, --name <NAME>
          Name of the storage.
          
          Recommended to keep this name unique to refer easily.

  -i, --init <INIT_COMMAND>
          Command to initialize the storage. This command is run once after defining the storage.
          
          You can use {URL} and {STORAGE_DIR}  as shortcuts.

  -l, --list <LIST_COMMAND>
          Command to list the files in storage
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -d, --download <DOWNLOAD_COMMAND>
          Command to download a file from storage.
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -u, --upload <UPLOAD_COMMAND>
          Command to upload a file to storage.
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -D, --delete <DELETE_COMMAND>
          The delete command to remove a file from storage You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options

  -M, --processes <MAX_PROCESSES>
          Number of maximum processes to run simultaneously
          
          [default: 1]

      --url <URL>
          You can set a string to replace {URL} placeholder in commands

      --storage-dir <STORAGE_DIR>
          You can set a string to replace {STORAGE_DIR} placeholder in commands

  -h, --help
          Print help (see a summary with '-h')

You can use the following placeholders in your commands. These are replaced with the actual paths in runtime and commands are run with concrete paths.

  • {URL} : The content of --url option. (default "")
  • {STORAGE_DIR} Content of --storage-dir option. (default "")
  • {RELATIVE_CACHE_PATH} The portion of the cache path after .xvc/.
  • {ABSOLUTE_CACHE_PATH} The absolute local path for the cache element.
  • {RELATIVE_CACHE_DIR} The portion of directory that contains the file after .xvc/.
  • {ABSOLUTE_CACHE_DIR} The portion of the local directory that contains the file after .xvc.
  • {XVC_GUID}: Repository GUID used in storages to differ repository elements
  • {FULL_STORAGE_PATH}: Concatenation of {URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_PATH}
  • {FULL_STORAGE_DIR}: Concatenation of {URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_DIR}
  • {LOCAL_GUID_FILE_PATH}: The path that contains guid of the storage locally. Used only in --init option.
  • {STORAGE_GUID_FILE_PATH}: The path that should have guid of the storage, in storage. Used only in --init option.

Examples

Create a generic storage in the same filesystem

You can create a storage that's using shell commands to send and receive files to another location in the file system.

There are two variables that you can use in the commands. For a storage in the same file system, --url could be blank and --storage-dir could be the location you want to define.

$ xvc storage new-generic
    --url ""
    --storage-dir $HOME/my-xvc-storage
    ...

You need to specify the commands for the following operations:

  • init: The command that's used to create the directory that will be used as a storage. It should also copy XVC_STORAGE_GUID_FILENAME (currently .xvc-guid) to that location. This file is used to identify the location as an Xvc storage.
$ xvc storage new-generic
      ...
      --init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
      ...

Note that if the command doesn't contain {LOCAL_GUID_FILE_PATH} and {STORAGE_GUID_FILE_PATH} variables, it won't be run and Xvc will report an error.

  • list: This operation should list all files under {URL}{STORAGE_DIR}. The list is filtered through a regex that matches the format of the paths. Hence, even the command lists all files in the storage, Xvc will consider only the relevant paths.

All paths should be listed in separate lines.

$ xvc storage new-generic
        ...
        --list 'ls -1 {URL}{STORAGE_DIR}'
        ...
  • upload: The command that will copy a file from local cache to the storage. Normally, it uses {ABSOLUTE_CACHE_PATH} variable. For the local file system, we also need to create a directory before copying.
$ xvc storage new-generic
     ...
     --upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
     ...
  • download: This command will be used to copy from storage to the local cache. It must create local cache directory as well.
$ xvc storage new-generic
    ...
    --download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
    ...
  • delete: This operation is used to delete the storage file. It shouldn't touch the local file in any way, otherwise you may lose data.
$ xvc storage new-generic
    ...
    --delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
    ...

In total, the command you write is the following. It defines all operations of this storage.

$ xvc storage new-generic
    --url ""
    --storage-dir $HOME/my-xvc-storage
    --init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
    --list 'ls -1 {URL}{STORAGE_DIR}'
    --upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
    --download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
    --delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'

Create a storage using rsync

Rsync is found for all popular platforms to copy file contents. Xvc can use it to maintain a storage if you already have a working rsync setup.

We need to define operations for init, upload, download, list and delete with rsync or ssh. Some of the commands need ssh to perform operations, like creating a directory. We'll use placeholders for paths.

As rsync URL format is slightly different than SSH, we will define the commands verbosely.

Suppose you want to use your account at user@example.com to store your Xvc files. You want to store the files under /home/user/my-xvc-storage.

We assume you have configured public key authentication for your account. Xvc doesn't receive user input during storage operations, and can't receive your password during runs.

We first define these as our --url and --storage-dir options.

$ xvc --url user@example.com
        --storage-dir '/home/user/my-xvc-storage'
        ...

Initialization command must create this directory and copy the storage GUID file to its respective location.

$ xvc
  ...
  --init "ssh {URL} 'mkdir -p {STORAGE_DIR}' ; rsync -av '{LOCAL_GUID_FILE_PATH}' '{URL}:{STORAGE_GUID_FILE_PATH}'"

Note the use of : in rsync command. As it doesn't support ssh:// URLs currently, we are using a form that's compatible with both ssh and rsync as URL. It may be possible to use && between ssh and rsync commands, but if the first command fails (e.g. the directory already exists), we still want to copy the guid file.

Caveats

Technical Details

The paths in list commands are filtered through a regex. They are matched against {REPO_GUID}/{RELATIVE_CACHE_DIR}/0 pattern and only the {RELATIVE_CACHE_DIR} portion is reported. Any line that doesn't conform to this pattern is ignored. You can any listing command that returns a recursive file list, and only the pattern matching elements are considered.

xvc storage new s3

Purpose

Configure an S3 (or a compatible) service as an Xvc storage.

Synopsis

$ xvc storage new rsync --help
Add a new rsync storages

Uses rsync in separate processes to communicate. This can be used when you already have an SSH/Rsync connection. It doesn't prompt for any passwords. The connection must be set up with ssh keys beforehand.

Usage: xvc storage new rsync [OPTIONS] --name <NAME> --host <HOST> --storage-dir <STORAGE_DIR>

Options:
  -n, --name <NAME>
          Name of the storage.
          
          Recommended to keep this name unique to refer easily.

      --host <HOST>
          Hostname for the connection in the form host.example.com  (without @, : or protocol)

      --port <PORT>
          Port number for the connection in the form 22. Doesn't add port number to connection string if not given

      --user <USER>
          User name for the connection, the part before @ in user@example.com (without @, hostname). User name isn't included in connection strings if not given

      --storage-dir <STORAGE_DIR>
          storage directory in the host to store the files

  -h, --help
          Print help (see a summary with '-h')

Examples

You must setup an SSH connection

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new rsync --name backup --host e1.xvc.dev --user iex --storage-dir /tmp/xvc-backup/

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa
[DELETE] [CWD]/.xvc/b3/3c6/70f
[DELETE] [CWD]/.xvc/b3/3c6
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d
[DELETE] [CWD]/.xvc/b3/7aa/354
[DELETE] [CWD]/.xvc/b3/7aa
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb
[DELETE] [CWD]/.xvc/b3/d7d/629
[DELETE] [CWD]/.xvc/b3/d7d
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

xvc storage new s3

Purpose

Configure an S3 (or a compatible) service as an Xvc storage.

Synopsis

$ xvc storage new s3 --help
Add a new S3 storage

Reads credentials from `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new s3 [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage

          This must be unique among all storages of the project

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix

          [default: ]

      --bucket-name <BUCKET_NAME>
          S3 bucket name

      --region <REGION>
          AWS region

  -h, --help
          Print help (see a summary with '-h')

Examples

Before calling any commands that use this storage, you must set the following environment variables.

  • AWS_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.
  • AWS_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new s3 --name backup --bucket-name xvc-test --region eu-central-1 --storage-prefix xvc-storage

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

xvc storage new gcs

Purpose

Configure an Google Cloud Storage service as an Xvc storage.

Synopsis

$ xvc storage new gcs --help
Add a new Google Cloud Storage storage

Reads credentials from `GCS_ACCESS_KEY_ID` and `GCS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new gcs [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage

          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server, e.g., europe-west3

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix

          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Please configure S3 compatible interface to your Google Cloud Storage account before using this command.

Before calling any commands that use this storage, you must set the following environment variables.

  • GCS_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.
  • GCS_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new gcs --name backup --bucket-name xvc-test --region europe-west-3 --storage-prefix xvc-storage

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

xvc storage new minio

Purpose

Create a new Xvc storage on a MinIO instance. It allows to store tracked file contents in a Minio server.

Synopsis

$ xvc storage new minio --help
Add a new Minio storage

Reads credentials from `MINIO_ACCESS_KEY` and `MINIO_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new minio [OPTIONS] --name <NAME> --endpoint <ENDPOINT> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage

          This must be unique among all storages of the project

      --endpoint <ENDPOINT>
          Minio server url in the form https://myserver.example.com:9090

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix

          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Before calling any commands that use this storage, you must set the following environment variables.

  • MINIO_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
  • MINIO_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new minio --name backup --endpoint http://e1.xvc.dev:9000 --bucket-name xvc-tests --region us-east-1 --storage-prefix xvc

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

Caveats

--name NAME is not verified to be unique but you should use unique storage names to refer them later. You can also use storage GUIDs listed by xvc storage list to refer to storages.

You must have a valid connection to the server.

Xvc uses Minio API port (9001, by default) to connect to the server. Ensure that it's accessible.

For reasons caused from the underlying library, Xvc tries to connect http://xvc-bucket.example.com:9001 if you give http://example.com:9001 as the endpoint, and xvc-bucket as the bucket name. You may need to consider this when you have servers running in exact URLs. If you have a http://minio.example.com:9001 as a Minio server, you may want to supply http://example.com:9001 as the endpoint, and minio as the bucket name to form the correct URL. This behavior may change in the future.

Technical Details

This command requires Xvc to be compiled with minio feature, which is on by default. It uses Rust async features via rust-s3 crate, and may add some bulk to the binary. If you want to compile Xvc without these features, please refer to How to Compile Xvc document.

The command creates .xvc-guid file in http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/.xvc-guid. The file contains the unique identifier for this storage. The same identifier is also recorded to the project.

A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}} is saved to http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}. {{REPO_ID}} is the unique identifier for the repository created during xvc init. Hence if you use a common storage for different Xvc projects, their files are kept under different directories. There is no inter-project deduplication.

xvc storage new r2

Purpose

Use Cloudflare R2 as an Xvc storage.

Synopsis

$ xvc storage new r2 --help
Add a new R2 storage

Reads credentials from `R2_ACCESS_KEY_ID` and `R2_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new r2 [OPTIONS] --name <NAME> --account-id <ACCOUNT_ID> --bucket-name <BUCKET_NAME>

Options:
  -n, --name <NAME>
          Name of the storage

          This must be unique among all storages of the project

      --account-id <ACCOUNT_ID>
          R2 account ID

      --bucket-name <BUCKET_NAME>
          Bucket name

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix

          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Before calling any commands that use this storage, you must set the following environment variables.

  • R2_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Cloudflare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.
  • R2_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Cloudfare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new r2 --name backup --bucket-name xvc-test --account-id e5dcca29209558eb9de6c07ae53b0a6f --storage-prefix xvc-storage

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

xvc storage new wasabi

Purpose

Configure a Wasabi service as an Xvc storage.

Synopsis

$ xvc storage new wasabi --help
Add a new Wasabi storage

Reads credentials from `WASABI_ACCESS_KEY_ID` and `WASABI_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new wasabi [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME>

Options:
  -n, --name <NAME>
          Name of the storage

          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --endpoint <ENDPOINT>
          Endpoint for the server, complete with the region if there is

          e.g. for eu-central-1 region, use s3.eu-central-1.wasabisys.com as the endpoint.

          [default: s3.wasabisys.com]

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix

          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Before calling any commands that use this storage, you must set the following environment variables.

  • WASABI_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.
  • WASABI_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new wasabi --name backup --bucket-name xvc-test --endpoint s3.wasabisys.com --storage-prefix xvc-storage

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

xvc storage new digital-ocean

Purpose

Configure a Digital Ocean Spaces service as an Xvc storage.

Synopsis

$ xvc storage new digital-ocean --help
Add a new Digital Ocean storage

Reads credentials from `DIGITAL_OCEAN_ACCESS_KEY_ID` and `DIGITAL_OCEAN_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.

Usage: xvc storage new digital-ocean [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Before calling any commands that use this storage, you must set the following environment variables.

  • DIGITAL_OCEAN_ACCESS_KEY_ID or XVC_STORAGE_ACCESS_KEY_ID_<storage_name>: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
  • DIGITAL_OCEAN_SECRET_ACCESS_KEY or XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.

The command works only in Xvc repositories.

$ git init
...
$ xvc init

$ xvc-test-helper create-directory-tree --directories 1 --files 3  --seed 20230211

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

Xvc only sends and receives tracked files.

$ xvc file track dir-0001

You can define a storage bucket as storage and begin to use it.

$ xvc storage new digital-ocean --name backup --bucket-name xvc --region fra1 --storage-prefix xvc

Send files to this storage.

$ xvc file send dir-0001 --to backup

You can remove the files you sent from your cache and workspace.

$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3

$ rm -rf dir-0001/

Then get back them from the storage.

$ xvc file bring --from backup dir-0001

$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin

1 directory, 3 files

If you want to remove a file and all of its versions from a storage, you can use xvc file remove command.

$ xvc file remove --from-storage backup dir-0001/

Utilities

xvc root

Purpose

Shows the Xvc root project directory where .xvc/ resides.

Synopsis

$ xvc root --help
Find the root directory of a project

Usage: xvc root [OPTIONS]

Options:
      --absolute  Show absolute path instead of relative
  -h, --help      Print help

Examples

xvc root can be used in scripts to make paths relative to the Xvc project root.

By default, it shows the relative path.

$ xvc root
..

When you supply --absolute, it prints the absolute path.

$ xvc root --absolute
/home/user/my-xvc-project/

xvc check-ignore

Purpose

Check whether a path is ignored or whitelisted by Xvc.

Synopsis

$ xvc check-ignore --help
Check whether files are ignored with `.xvcignore`

Usage: xvc check-ignore [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Targets to check. If no targets are provided, they are read from stdin

Options:
      --ignore-filename <IGNORE_FILENAME>
          Filename that contains ignore rules
          
          This can be set to .gitignore to test whether Git and Xvc work the same way.
          
          [default: .xvcignore]

  -h, --help
          Print help (see a summary with '-h')

Examples

$ git init
...
$ xvc init

You can add files and directories to be ignored by Xvc to .xvcignore files.

$ zsh -cl "echo 'my-dir/my-file' >> .xvcignore"

By default it checks the files supplied from stdin.

$ zsh -cl 'echo my-dir/my-file | xvc check-ignore'
[IGNORE] [CWD]/my-dir/my-file

The .xvcignore file format is identical to .gitignore file format.

$ cat .xvcignore

# Add patterns of files xvc should ignore, which could improve
# the performance.
# It's in the same format as .gitignore files.

.DS_Store
my-dir/my-file

If you supply paths from the CLI, they are checked against the ignore rules in .xvcignore.

$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[NO MATCH] [CWD]/another-dir/another-file

You can also add whitelist patterns to ,.xvcignore files.

$ zsh -cl "echo '!another-dir/*' >> .xvcignore"
$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[WHITELIST] [CWD]/another-dir/another-file

This utility can be used to check any other ignore rules in other files as well. You can specify an alternative ignore filename with --ignore-filename option. The below command is identical to git check-ignore and should give the same results.

$ xvc check-ignore --ignore-filename .gitignore

xvc aliases

Synopsis

$ xvc aliases --help
Print command aliases to be sourced in shell files

Usage: xvc aliases

Options:
  -h, --help  Print help

Examples

You can include aliases in interactive shells.

$ . $(xvc aliases)
$ pvc --help
Pipeline management commands

Usage: xvc pipeline [OPTIONS] <COMMAND>

Commands:
  new     Add a new pipeline
  update  Rename, change dir or set a pipeline default
  delete  Delete a pipeline
  run     Run a pipeline
  list    List all pipelines
  dag     Generate mermaid diagram for the pipeline
  export  Export the pipeline to a YAML, TOML or JSON file
  import  Import the pipeline from a file
  step    Step management commands
  help    Print this message or the help of the given subcommand(s)

Options:
  -n, --name <NAME>  Name of the pipeline this command applies to
  -h, --help         Print help information

If you add the above line to your .bashrc or .zshrc, these aliases will always be available.

You can get a list of aliases.

$ xvc aliases

alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'

If there are aliases that you'd rather not use with Xvc, you can unalias them.

This command is not implemented yet. Please see https://github.com/iesahin/xvc/issues/176 for its progress.

Rust API

xvc

See https://docs.rs/xvc/ for latest version of the Xvc API

xvc-config

See https://docs.rs/xvc-config/ for latest version of the Xvc API

xvc-core

See https://docs.rs/xvc-core/ for latest version of the Xvc API

xvc-ecs

xvc-file

See https://docs.rs/xvc-file/ for latest version of the Xvc API

xvc-logging

See https://docs.rs/xvc-logging/ for latest version of the Xvc API

xvc-pipeline

See https://docs.rs/xvc-pipeline/ for latest version of the Xvc API

xvc-storage

See https://docs.rs/xvc-storage/ for latest version of the Xvc API

xvc-walker

See https://docs.rs/xvc-walker/ for latest version of the Xvc API

Xvc Architecture

The malleability of the material (bits and bytes) we're working with leads to difficulties in architecting software. Unlike real architecture, bits and bytes don't bring natural restrictions. It's not possible to build skyscrapers with mud bricks, and our material is much more malleable. There are too many options, too many ways to solve problems that it's easy to merge in technical mud with the decisions we make.

Software developers created a set of architectural principles to overcome this unlimitation. Most of these principles are bogus. They are not tested on the field. We seldom have software that's still perfectly maintainable after ten years. Usually, reading and understanding the code is more difficult than coming up with a new solution and rewriting it.

In this chapter, we describe the problems, assumptions, and solutions in Xvc's intended domain. It's a work in progress but should give you ideas about the intentions behind decisions.

After two decades, I (un)learned a few basic principles regarding software development.

  • Object Oriented Programming doesn't work. Mixing data and functions (methods) isn't a good way to write programs. It leads to artificial layers and structures that become burdensome the long run. It forces the developer to think about both the data and functionality at the same time. This makes reasoning and solving the problem harder than it should be.

  • Data structures are more important than algorithms. Using a few distinct, well thought data structures is more important than creating the best algorithm. Algorithms are replaceable locally without much peripheral impact. Modifying data structures usually requires updates to all related elements.

  • DRY is overrated. It may be a good principle after you write the first version. However, during the actual development phase, it's not a good idea to try not to repeat yourself. What parts of the program repeat, what parts rhyme, and what should be abstracted can be seen after we write the whole. Trying to apply abstract principles to exploratory development hinders the ability to solve problems as plainly as possible.

  • More errors are done in the name of abstraction than the reverse. Abstractions don't always help. They usually distribute a single functionality across arbitrary layers. In the age of LSP, it's easier to find repeating functionality and merge/rewrite, rather than fixing incorrect assumptions about abstractions. Problems with repeating code are obvious and easier to fix than problems with abstractions.

  • Vertical architecture is more important than horizontal architecture. Vertical architecture means the lower the number of layers between the user and their intention, the better. If the user wants to copy a file, creating a layer of abstract classes to make this more modular doesn't result in more resilient software. If you want to detect whether we're in a Git repository, checking the presence of .git directory is simpler than creating a few abstract classes that work for more than one SCM, and implementing abstract methods for them. The architecture shouldn't try to satisfy abstract patterns, it should make the path between the user's action and effect as direct as possible.

Xvc Modules (Crates)

Xvc is composed of modules that can be tested and used independently. core module is in the middle of the architecture. Lower-level crates interface with the OS and convert these to data structures. Higher levels use these data structures to implement functionality.

For example xvc-walker crate interfaces with the directories and paths, ignore rules and serves a set of paths with their metadata. xvc-file crate uses these to check whether a file is changed or not.

  • logging: Logger definitions and debugging macros.
  • walker: A file system directory walker that checks ignore files. It can also notify the changes in the directory via channels after the initial traversal.
  • config: Configuration framework that loads configuration from various levels (Default, System, User, Project, Environment) and merges these with command line options for each module.
  • ecs: The entity-component system responsible for saving and loading state of all data structures, along with their associations and queries.
  • storage: Commands and functionality to configure external (local or cloud) locations to store file content.
  • core: Xvc specific data structures and utilities.

All user level modules use this module for shared functionality.

  • file: Commands to track files and utilities around file management.
  • pipeline: Commands to define data pipelines as DAGs and run them.

The current dependency graph where lower-level modules are used directly is this:

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline

xvc-file --> xvc-config
xvc-file --> xvc-core
xvc-file --> xvc-ecs
xvc-file --> xvc-logging
xvc-file --> xvc-walker
xvc-file --> xvc-storage

xvc-pipeline --> xvc-config 
xvc-pipeline --> xvc-core
xvc-pipeline --> xvc-ecs
xvc-pipeline --> xvc-logging
xvc-pipeline --> xvc-walker

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs

xvc-walker --> xvc-logging

After the crate interfaces are stabilized, all lower-level functions will be reused from xvc-core. It will provide the basic Xvc API. In this case, the graph will be simplified.

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline

xvc-file --> xvc-core

xvc-pipeline --> xvc-core

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs
xvc-core --> xvc-storage

xvc-walker --> xvc-logging

Any improvement in user-level API will be done higher than xvc-core levels. Any improvement in lower-level modules will be done in dependencies of xvc-core.

Goals

Xvc is an CLI MLOps tool to track file, data, pipeline, experiment, model versions.

It has the following goals:

  • Enable to track any kind of files, including large binary, data and models in Git.
  • Enable to get subset of these files.
  • Enable to remove files from workspace temporarily, and retrieve them from cache.
  • Enable to upload and download these files to/from a central server.
  • Enable users to run pipelines composed of commands.
  • Be able to invalidate pipelines partially.
  • Enable to run a pipeline or arbitrary commands as experiments, and store and retrieve them.

Xvc users are data and machine learning professionals that need to track large amounts of data. They also want to run arbitrary commands on this data when it changes. Their goal is to produce better machine learning models and better suited data for their problems.

We have three quality goals:

  • Robustness: The system should be robust for basic operations.
  • Performance: The overall system performance must be within the ballpark of usual commands like b3sum or cp.
  • Availability: The system must run on all major operating systems.

Xvc users work with large amounts of data. They want to depend on Xvc for basic operations like tracking file versions, and uploading these to a central location.

They don't want to wait too long for these operations on common hardware.

They would like to download their data to any system running various operating systems.

Xvc Cache

The cache is where Xvc copies the files it tracks.

It's located under the .xvc directory.

Instead of the file tree that's normally used to address files, it uses the content digest of files to organize them.

In a standard file hierarchy, we have files in paths like /home/iesahin/Photos/my-photo.png. Xvc doesn't use such a tree in its cache. It uses paths like .xvc/b3/a12/b45/d789a...f54/0.png to refer to files.

Producing the cache path from its content causes cache paths to change when the files are updated. For example, in a standard file system, if you save another photo on top of my-photo.png, the first version will be lost. Xvc stores these two versions in different locations in the cache, so they are not lost.

There are 4 parts of this cache path.

.xvc part is the standard directory xvc init command creates. It resides in the root folder of your project.

b3/ denotes the [digest type] of the content digest. Xvc supports more than one algorithm to calculate content digests. [HashAlgorithm][https://docs.rs/xvc-core/latest/xvc_core/types/hashalgorithm/enum.HashAlgorithm.html] enum shows which algorithms are supported. Each of these algorithms has a 2-letter prefix.

  • b3: BLAKE3
  • b2: BLAKE2s
  • s3: SHA2-256
  • s2: SHA3-256

Note that, all these digest algorithms produce 256bits/32 bytes digests. This digest is converted to 64 hexadecimal digits. To keep the total path length shorter, Xvc requires digests to be 32 bytes in length.

The third part in the cache path is these 64 hexadecimal digits in the form a12/b45/d789...f54/. 64 digits are split into directories to keep the number of directories under one directory lower. Had Xvc put all cache elements in a single directory, it could lead to degraded performance in some file systems. With this arrangement, b3/ can contain at most 4096 directories, that contain 4096 directories each. With usual distribution and good hash algorithms, there won't be more than 4000 elements per directory until 64 billion files are in the cache. (4000³)

The fourth part is the 0.png part, that's the file itself with the same extension but with 0 as the basename. Xvc uses digest as a directory instead of the file name. There may be times when the file in the cache should be used manually, on cloud storage for example. The extension is kept for this reason, to make sure that the OS recognizes the file type correctly.

The rename to 0 means, that this is the whole file. In the future, when Xvc will support splitting large files to transfer to remotes, all parts of the file will be put into this directory.

Storages also use the same cache structure, with an added GUID part to use single storage for multiple projects.

The Architecture of Xvc Entity Component System

Xvc uses an entity component system (ECS) in its core. ECS architecture is popular among game development, but didn't find popularity in other areas. It's an alternative to Object-Oriented Programming.

There are a few basic notions of ECS architecture. Although it may differ in other frameworks, Xvc assumes the following:

  • An entity is a neutral way of tracking components and their relationships. It doesn't contain any semantics other than being an entity. An entity in Xvc is an atomic integer tuple. (XvcEntity)

  • A component is a bundle of associated data about these entities. All semantics of entities are described through components. Xvc uses components to keep track of different aspects of file system objects, dependencies, storages, etc.

  • A system is where the components are created and modified. Xvc considers all modules that interact with components as separate systems.

Suppose you want to track a new file in Xvc. Xvc creates a new entity for this file. Associates the path (XvcPath) with this entity. Creates an instance of XvcMetadata that represent file size and timestamp, and associates it with this entity. An XvcDigest struct is associated with the entity to show the file's content digest.

The difference from OOP is that there is no basic or main object. There is no file object that contains a digest, or a directory object that is inherited from files.

If you want to work only with digests and want to find the workspace paths associated with them, you can write a function (system in Entity-Component-System) that starts from XvcDigest records and collect the associated paths. If you want to get only the files larger than a certain size, you can work with XvcMetadata, filter them and get the paths later. In contrast, in an OOP setting, these data are associated with paths and when you want to do such operations, you need to load paths and their associations first. OOP way of doing things is usually against the principle of locality.

The whole idea is to be flexible for further changes. For example, these days Xvc doesn't have notions of data and models. Files are just files. It doesn't have different functionality for files that are models or data. When this distinction will be added, an XvcModel component will be created and associated with the same entity of an XvcPath, a set of XvcFeatures will be associated in the same way XvcMetadata is associated with XvcPath. It will allow working with some paths as model files but it won't require paths to be known beforehand. There may be other metadata, like features or version associated with models that are more important. There may be some models without a file system path, maybe living only in memory or in the cloud.

In contrast, OOP would define this either by inheritance (a model is a path) or containment (a model has a path). When you select any of these, it becomes a relationship that must be maintained indefinitely. When you only have an integer that identifies these components, it's much easier to describe models without a path later. There is no predefined relationship between paths and models. You can have paths without models, or models without paths.

The architecture is approximately similar to database modeling. Components are in-memory tables, albeit they are small and mostly contain a few fields. Entities are numeric primary keys. Systems are insert, query and update mechanisms.

Stores

An XvcStore in its basic definition is a map structure between XvcEntity and a component type T It has facilities for persistence, iteration, search and filtering. It can be considered a system in the usual ECS sense.

Loading and Saving Stores

As our goal is to track data files with Git, stores save and load binary files' metadata to text files. Instead of storing the binary data itself in Git, Xvc stores information about these files to track whether they are changed. By default, these metadata are persisted to JSON. Component types must be serializable because of this. They are meant to be stored to disk in JSON format. Nevertheless, as they are almost always composed of basic types [serde] supports, this doesn't pose a difficulty in usage. The JSON files are then commit to Git.

Note that, there are usually multiple branches in Git repositories. Also multiple users may work on the same branch.

When these text files are reused by the stores, they are modified and this may lead to merge conflicts. We don't want our users to deal with merge conflicts with entities and components in text files. This also makes it possible to use binary formats like MessagePack in the future.

Suppose user A made a change in XvcStore<XvcPath> by adding a few files. Another user B made another change to the project, by adding another set of files in another copy of the project. This will lead to merge conflicts:

  • XvcEntity counter will have different values in A and B's repositories.
  • XvcStore<XvcPath> will have different records in A and B's repositories.

Instead of saving and loading to monolithical files, XvcStore saves and loads event logs. There are two kind of events in a store:

  • Add(XvcEntity, T): Adds an element T to a store.
  • Remove(XvcEntity): Removes the element with entity id.

These events are saved into files. When the store is loaded, all files after the last full snapshot are loaded and replayed.

When you add an item to a store, it saves the Add event to a log. These events are then put into a vector. A BTreeMap is also created by this vector.

When an item is deleted, a Remove event is added to the event vector. While loading, stores removes the elements with Remove events from the BTreeMap. So the final set of elements doesn't contain the removed item.

The second problem with multiple branches is duplicate entities in separate branches. Xvc uses a counter to generate unique entity ids. When a store is loaded, it checks the last entity id in the event log and uses it as the starting point for the counter. But using this counter as is causes duplicate values in different branches. Xvc solves this by adding a random value to these counter values.

Since v0.5, XvcEntity is a tuple of 64-bit integers. The first is loaded from the disk and is an atomic counter. The second is a random value that is renewed at every command invocation. Therefore we have a unique entity id for every run, that's also sortable by the first value. Easy sorting with integers is sometimes required for stable lists.

Inverted Index

Stores also have a inverted index for quick lookup. They store value of T as key and a list of entities that correspond to this key. For example, when we have a path that we stored, it's a single operation to get the corresponding XvcEntity and after this, all recorded metadata about this path is available.

All search, iteration and filtering functionality is performed using these two internal maps.

In summary, a store has four components.

  • An immutable log of previous events: Vec<Event<T>>
  • A mutable log of current events: Vec<Event<T>>
  • A mutable map of the current data: BTreeMap<XvcEntity, T>
  • A mutable map of the entities from values: BTreeMap<T, Vec<XvcEntity>>

Note that, when two branches perform the same operation, the event logs will be different, as the random part of XvcEntity is different. When two parties branches merge, the inverted index may contain conflicting values. In this case, a fsck command is used to merge the store files and merge conflicting entity ids.

Insert, update and delete operations affect mutable log and maps. Queries, iteration and such non-destructive operations are done with the maps. When loading, all log files are merged in immutable log. No standard operation touches the event logs. All log modifications are done outside of the normal worflow. When saving, only the mutable log is saved. Note that only can only be added to the log, they are not removed. (See xvc fsck --merge-stores for merging store files.)

Relationship Stores

XvcStore keeps component-per-entity. Each component is a flat structure that doesn't refer to other components.

Xvc also has relation stores that represent relationships between entities, and components. Similar to the database Entity-Relationship model, there are three kinds of the relationship store:

R11Store<T, U> keeps two sets of components associated with the same entity. It represents a 1-1 relationship between T and U. It contains two XvcStores for each component type. These two stores are indexed with the same XvcEntity values. For example, an R11Store<XvcPath, XvcMetadata> keeps track of path metadata for the identical XvcEntity keys.

R1NStore<T, U> keeps parent-child relationships. It represents a 1-N relationship between T and U. On top of two XvcStores, this one keeps track of relationships with a third XvcStore<XvcEntity>. It lists which U's are children of Ts. For example, a value of XvcPipeline can have multiple XvcSteps. These are represented with R1NStore<XvcPipeline, XvcStep>. This struct has parent-to-child and child-to-parent functions that can be used get children of a parent, or parent of child element.

The third type is RMNStore<T, U>. This one keeps arbitrary number of relationships between T and U. Any number of Ts may correspond to any number of Us. This type of store keeps the relationships in two XvcStore<XvcEntity>'s.

Xvc Pipelines State Machine

Xvc pipelines use a state machine to track the progress of each step. Each step has a state that is updated as the pipeline is executed.

stateDiagram-v2
    [*] --> Begin
    Begin --> DoneWithoutRunning: RunNever
    Begin --> WaitingDependencySteps: RunConditional
    WaitingDependencySteps --> WaitingDependencySteps: DependencyStepsRunning
    WaitingDependencySteps --> CheckingMissingDependencies: DependencyStepsFinishedSuccessfully
    WaitingDependencySteps --> Broken: DependencyStepsFinishedBroken
    WaitingDependencySteps --> CheckingMissingDependencies: DependencyStepsFinishedBrokenIgnored
    CheckingMissingDependencies --> CheckingMissingDependencies: MissingDependenciesIgnored
    CheckingMissingDependencies --> Broken: HasMissingDependencies
    CheckingMissingDependencies --> CheckingMissingOutputs: NoMissingDependencies
    CheckingMissingOutputs --> CheckingMissingOutputs: MissingOutputsIgnored
    CheckingMissingOutputs --> CheckingTimestamps: NoMissingOutputs
    CheckingMissingOutputs --> WaitingToRun: HasMissingOutputs
    CheckingTimestamps --> CheckingTimestamps: TimestampsIgnored
    CheckingTimestamps --> CheckingDependencyContentDigest: HasNoNewerDependencies
    CheckingTimestamps --> WaitingToRun: HasNewerDependencies
    CheckingDependencyContentDigest --> WaitingToRun: ContentDigestIgnored
    CheckingDependencyContentDigest --> DoneWithoutRunning: ContentDigestNotChanged
    CheckingDependencyContentDigest --> WaitingToRun: ContentDigestChanged
    DoneWithoutRunning --> Done: CompletedWithoutRunningStep
    WaitingToRun --> WaitingToRun: ProcessPoolFull
    WaitingToRun --> Running: StartProcess
    WaitingToRun --> Broken: CannotStartProcess
    Running --> Running: WaitProcess
    Running --> Broken: ProcessTimeout
    Running --> Done: ProcessCompletedSuccessfully
    Running --> Broken: ProcessReturnedNonZero
    Broken --> Broken: HasBroken
    Done --> Done: HasDone
    Done --> [*]
    Broken --> [*]

A step starts in the Begin state. It must wait for all its dependency steps if --when is set to by_dependencies (the default) in xvc pipeline step new or xvc pipeline step update. If this option is set to never, the step will never run and will move to the DoneWithoutRunning state just after begin. If this option is set to always, the step will run regardless of the changes in the dependencies and will move to the WaitingDependencySteps even if dependencies are missing, broken, or have not changed.

If --when option is set to by_dependencies, the steps check the following conditions before running:

  • All dependency steps must be in the Done state.
  • There should be no missing dependency files.
  • There should be no broken dependency processes.
  • Dependency files should be newer, or the content digest should be different from the step outputs.

If any of these conditions are met, the step will move to the WaitingDependencySteps state.

Comparisons

To avoid unnecessary work, we need to find differences across versions. What has changed between the previous version and this version of type T?

Xvc is built bottom up, with vertical, long functions that do one thing. For example, xvc file track is written separately from xvc file recheck, and the commonalities have arisen after these implementations.

I didn't start from traits and try to fit everything to a model. Instead, we began from concrete enums and structs. Then saw some of these share common functionality and thought to group this common functionality as a trait after implementing and refactoring concrete functions.

I saw that the diff pattern across all comparison functions. In xvc pipeline, dependencies need to detect changes to decide whether to invalidate them. In xvc file, files and directories need to detect changes to decide whether they should be carried into the cache.

It's easy to make comparison/subtraction when the data types are numeric. For a signed integer, you can get a single numeric value as diff with diff = a - b. For complex data structures, representing the change is not straightforward.

We keep track of everything in the repository in stores. These serialize a type T to a file, and get it back when needed. Diff pattern works with these types. Sometimes, there happens to be no record of something we have in the repository. Sometimes, we only have only the record, and not the actual thing on disk. The diff should also handle this.

Instead of trying to come up with wizardry, we decided to represent this with five conditions.

  • Identical: When two things of the same type T are equal. Nothing has changed between the actual version and its record.

  • RecordMissing { actual: T }: If we have something on workspace, but can't find the respective record. For example, a new file is added to the workspace, but xvc file track detects it for the first time.

  • ActualMissing { record: T }: We found a record in the store, but the corresponding file in the workspace is not where it should be. For example, a tracked file is deleted by the user, but the record is still there.

  • Difference { record: T, actual: T }: There is a record, but the actual file in workspace isn't identical with it. When a tracked file is changed, and its content now returns a different value, this can be reflected with Difference.

  • Skipped: When the comparison seems unnecessary or irrelevant. For example, if we know a file hasn't changed by checking its metadata. In this case, we don't calculate its content digest and set it to Skipped.

These five conditions are represented in Diff type.

As an entity may have more than one component, a comparison may require multiple Diffs. For example, we may want to compare an XvcPath, to see whether it has changed. This requires comparing its XvcMetadata, its ContentDigest if it's a file, its CollectionDigest if it's a directory, etc.

Storages

Xvc uses storages to store content of the files. These storages are different from Git remotes. They don't contain Git history of a repository, but they can store contents of the files tracked by Xvc.

A storage uses the same content-addresses used in Xvc cache to store the files. For example, if there is a file in Xvc repository that points to /b3/1886572424...defa/0.png in local cache, this path will be used to identify the content in storage as well.

Additionally, Xvc stores storage event logs that lists which operations are performed on that storage. By using these event logs, it's possible to identify what has gone on with storages without checking the file lists. These event logs are also shared with the other users, and a user can identify which files are present in a storage even without a connection.

Basic Operations

All storages should support the following operations:

  • Init to initialize a storage
  • List to list the files available in the storage.
  • Send to upload files from local cache to a storage.
  • Receive to download files from a storage to local cache.
  • Delete to delete file from a storage.

All these operations record a distinct event to the event log.

Events record the event, guid of the storage and the event content.

Event contents are like the following:

  • Init creates the necessary directories and the guid file in a storage
  • List includes the listing got from the storage. Once a list is retrieved from the storage, it's available for local operations. Most recent lists are starting point to determine files available in a storage.
  • Send event contains the affected paths. These paths are added to storage file list.
  • Receive event contains the affected paths. These paths are added to storage file list.
  • Delete to delete multiple files at once. These paths are removed from storage file list.

Storage types

Local Storages

A local storage is a directory in the local file system. It may be a mount point shared with others, or another disk that you use for backups and sharing.

  • Init uses std::fs::copy to copy the GUID file to the appropriate directory
  • List uses std::fs::listdir.
  • Send uses std::fs::copy with rayon.
  • Receive uses std::fs::copy with rayon.
  • Delete uses std::fs::remove_file with rayon.

Generic Storages

These storages define commands for each of the operations listed above. It allows to run external programs such as rsync, rclone, s5cmd. For such storages, commands for the above operations must be defined and they will be run in separate processes.

This storage type offloads the responsibility of exact operations to the user.

The user is expected to supply the value following variables:

  • {URL}: The url for the storage. This can be anything the commands to send/receive/list will accept. It's to build the paths with minor repeats.

  • {STORAGE_DIR}: You can separate the storage directory.

  • {PATH}: This is set by Xvc for each singular commands. It's a relative path to the local cache directory.

  • {PROCESS_POOL_SIZE}: This value is used to set the number of processes to perform operations. Setting this to 1 makes all operations sequential.

  • List Command: A command to list the {URL}. For example, for rsync --list-only {URL}{STORAGE_DIR}

  • Send Command: A command to send a file to {URL}{STORAGE_DIR}. It can use {URL} and should use {PATH} in the command. An example may be rsync -a {PATH} {URL}{STORAGE_DIR}{PATH}

  • Receive Command: A command to receive a file from a storage. It can use {URL} and {STORAGE_DIR}, and should use {PATH} in the command. Example: rsync -a {URL}{STORAGE_DIR}{PATH} {PATH}

  • Delete Command: A command to delete a file from the storage. It can use {URL} and {STORAGE_DIR}, and should use {PATH} in the command. Example: ssh {URL} "rm {STORAGE_DIR}{PATH}"

Generic storages use these commands to create multiple processes to send/receive/delete files. It's not as fast as using other types because of the overhead involved, but its flexibility is useful.

Git and Xvc

Xvc aims to fill the gap Git leaves for certain workflows. These workflows involve large binary data that shouldn't be replicated in each repository.

Xvc tracks all its metadata on top of Git. In most cases, Xvc assumes the presence of a Git repository where the user tracks the history, text files, and metadata. However, the relationship between these should be clear and separate.

Xvc doesn't (and shouldn't) use Git more than a user could use manually. Our aim is not to replace Git operations with Xvc operations or tamper with the internal structure of the Git repository. When Xvc uses Git to track ECS or other metadata, the operations must be separate and sandwich Xvc operations.

  • Any Git operation that involves to checkout commits, branches, tags, or other references must come before any Xvc operation. As Xvc relies on the files tracked by Git, resuming any state for Xvc operations should be complete before these operations start.

  • Xvc helps to stage and commit certain files in .xvc/ to Git. By default, any state-changing operation in Xvc adds a commit to Git.

  • Xvc also helps to store this changed metadata in a new or existing branch. In this case, a checkout must be done before Xvc records the files.

sequenceDiagram
    User ->> Xvc: xvc --from-ref my-branch --to-branch another-branch file track large-dir/
    Xvc ->> Git: git checkout my-branch
    Git ->> Xvc: branch = my-branch
    Xvc->> xvc-file: track large-dir/
    xvc-file ->> Xvc: Ok. Saved the stores and metadata.
    Xvc ->> Git: Do we have user staged files?
    Git ->> Xvc: Yes. This and this.
    Xvc ->> Git: Stash them. 
    Git ->> Xvc: Stashed user staged files. 
    Xvc ->> Git: git checkout -b another-branch
    Git ->> Xvc: branch = another-branch
    Xvc ->> Git: git add .xvc/
    Git ->> Xvc: added .xvc/
    Xvc ->> Git: git commit -m "Commit after xvc file track"
    Xvc ->> Git: Unstash files that we have stashed

Note that if the user has some already staged files, these are stashed and unstashed to the requested branch. This is a side effect of doing xvc commit operations on behalf of the user. The other option is to report an error and quit if the user has the --to-branch option set. The behavior may change in the future. For the time being, we will keep this stash-unstash operation for the user files.

One other issue is the library that we're going to use. I checked several options when I was writing auto-commit functionality.

At that time, I decided that the number of Git operations for each Xvc operation is less than five. These can be done by creating a Git process. The libraries are not 100% identical in features. Even the most widely used libgit2 doesn't provide shallow clones, or it's not possible to use git stash --staged.

The second reason for this is explainability. Instead of trying to explain to the user what we are doing with Git, we can report the commands we are running. The library interfaces are different from Git CLI. They need to be learned before reading the code. Using Git CLI is more dependable, observable, and understandable than trying to come up with a set of library calls.

Concepts

  • Digest: A digest is a 32-byte numeric sequence to identify a file, content or any other data. Xvc uses different algorithms to generate this sequence.
  • Associated Digest: This is a specific kind of digest associated with an entity. An entity can have more than one digests, like content digest or metadata digest. Xvc uses these different kinds of digests to avoid unnecessary digest calculations.
  • Recheck: Recheck is the process of linking a file to its copy in Xvc cache. Xvc uses different methods to recheck a file, like copy, symlink, hardlink or reflink.
  • Workspace: A project is broadly divided into 3 different types of directories. .xvc/ contains the cache and metadata of the tracked files and pipelines, .git/ contains the git repository and the workspace contains the files that are tracked by either Xvc or git. It's the place where you do your work.
  • Carry-In: Carry-in is the process of adding a new version of a file to Xvc cache. It's analogous to git commit.

Digest

A numerical summary of an entity. In Xvc digests are 32-bytes, and produced by BLAKE3 by default.

See Associated Digest for different types of digests.

Associated Digest

There may be multiple digests associated with an entity like path, directory or dependency. An associated digest is all digests associated with an entity.

Metadata Digest

Files and directories have metadata. Metadata shows information about creation, modification, access time of the file, or the size of it. Metadata is OS dependent in most cases. Xvc abstracts file and directory metadata with XvcMetadata struct. Metadata digest represents this abstraction in 32-bytes to compare changes in files and directories.

Content Digest

The content digest of a file is calculated by the data it contains. It calculates 32-bytes from the content. When content changes, this calculation result also change.

Collection Digest

Some entities in Xvc are composed of multiple elements. Examples are directories (composed of files), file lines, regex filter results, SQL query results etc. Instead trying to compare all elements, Xvc creates a 32-byte digest of the collection with the same conditions. For example, when a new file is added to a directory, its collection digest also changes. This is used keep track of changed directories easier than moving members around.

Development

Code and Documentation Conventions

  • Xvc is spelled capitalized in documentation. It's Xvc, not XVC, not xvc.