Introduction to Xvc
Xvc is a command line utility to track large files with Git, define dependencies between files to run commands when only these dependencies change, and run experiments by making small changes in these files for later comparison.
It's used mostly in Machine Learning scenarios where data and model files are large, code files depend on these and experiments must be compared via various metrics.
Xvc can use S3 and compatible cloud storages to upload tracked files with their exact version and can retrieve these later. This allows to delete them from the project when they are not needed to save space and get them back when needed. This facility can also be used for sharing these files. You can just clone the Git repository and get only the necessary Xvc-tracked files.
Xvc tracks files, directories and other elements by calculating their digests. These digests are used as address to store and find their locations in the cache and storages. When you make a change to a file, it gets a new digest and the changed version has a new address. This makes sure that all versions can be retrieved on demand.
Xvc can be used as a make
replacement to build multi-file projects with
complex dependencies. Unlike make
that detect file changes with timestamps,
Xvc checks the files via their content. This reduces false-positives in
Xvc pipelines are used to define steps to reach to a set of outputs. These steps have commands to run and may (or may not) produce intermediate outputs that other steps depend. Xvc pipelines allows steps to depend on other steps, other pipelines, text and binary files, directories, globs that select a subset of files, certain lines in a file, certain regular expression results, URLs, (hyper)parameter definitions in YAML, JSON or TOML files as of now. More dependency types like environment variables, database tables and queries, S3 buckets, REST query results, generic CLI command results, Bitcoin wallets, Jupyter notebook cells are in the plans.
For example, Xvc can be used to create a pipeline that depends on certain files in a directory via a glob, and a parameter in a YAML file to update a machine learning model. The same feature can be used to build software when the code or artifacts used in the software change. This allow binary outputs (as well as code inputs) to be tracked in Xvc. Instead of building everything from scratch in a new Git clone, a software project can reuse only the portions that require a rebuild. Binary distributions become much simpler.
This book is used as the documentation of the project. It is a work in progress as Xvc, and contain outdated information. Please report any errors and bugs in as the rest of project.
Common Usage Examples
🔽 Installation
You can get the binary files for Linux, macOS, and Windows from releases
page. Extract and copy the file to your $PATH
Alternatively, if you have Rust installed, you can build xvc:
$ cargo install xvc
If you want to use Xvc with Python console and Jupyter notebooks, you can also
install it with pip
$ pip install xvc
Note that pip installation doesn't make xvc
available as a shell command.
Please see for details.
Xvc supports dynamic completions for bash, zsh, elvish, fish and powershell. For example, run the following to add completions for bash:
echo "source <(COMPLETE=bash xvc)" >> ~/.bashrc
See completions section in the docs for others.
🚀 Initialize a directory for Xvc
$ xvc init
This command initializes the .xvc/
directory and adds a
file for specifying paths you wish to hide from Xvc.
💡 Git is not required to run Xvc. However running Xvc with Git is usually a good idea. Xvc can stage/commit metadata files (under
) used to track binary files and you can use branches for versioning as well. By default, you won't have to deal with Git commands to commit these metadata files. Xvc can manage the files it updates and hides your binary files from Git by default.If you don't want to use Xvc with Git, use
option when initializing.
👣 Track binary files
Add your data files and directories for tracking:
$ xvc file track my-data/
This command calculates content
hashes for data (using BLAKE-3, by default) and records them. Files are moved
to content-addressed directories under .xvc/b3
. Then they are copied to the
💡Tip: You can specify different recheck (checkout) methods for files and directories depending on your use case. Symlinks and hardlinks to the files under Xvc cache don't consume additional space but they are readonly. You can also use (copy-on-write) reflinks if your file system supports it and Xvc is built with
🫧 Checkout a subset of files as symlinks
You can copy and recheck (checkout) subsets of files from Xvc cache as symlinks to create multiple views. This is useful when you need a read-only access that won't consume additional space.
$ xvc file copy my-data/ another-view-to-my-data/
$ xvc file recheck another-view-to-my-data/ --as symlink
xvc file copy
andxvc file move
doesn't require file contents to be available. Xvc works only with their metadata and you can organize files without their content copied to workspace or cache.
💡 If you installed completions to your shell, Xvc completes file names even if they are not available in your local paths.
🌁 Send files to the cloud services
Configure a cloud storage to share the files you track with Xvc.
$ xvc storage new s3 --name my-storage --region us-east-1 --bucket-name xvc
You can send the files to this storage.
$ xvc file send --to my-storage
You can also send a subset of the files.
$ xvc file send 'my-data/training/*' --to my-storage
Xvc supports external directories, Rsync, AWS S3, Google Cloud Storage, MinIO, Cloudflare R2, Wasabi, Digital Ocean Spaces. Please create an issue if you want Xvc to support another cloud storage service.
💡 Xvc also supports any command to upload/download files. If your favorite service is not listed or you want to use another tool (s5cmd, rclone, etc.), you can specify a generic storage by supplying shell commands to upload and download.
📌 Important: Xvc never stores credentials to your connections and expects them to be available in the environment. It never makes network requests (for tracking, statistics, etc.) without your knowledge. You can compile without cloud connection support in case you want to make sure that it makes no connections to outside services.
🪣 Get files from cloud services
When you (or someone else) want to access these files later, you can clone the Git repository and get the files from the storage.
$ git clone
Cloning into 'my-machine-learning-project'...
$ cd my-machine-learning-project
$ xvc file bring my-data/ --from my-storage
This approach ensures convenient access to files from the shared storage when needed.
💡Tip: You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage. Xvc never stores any credentials.
🫖 Share files from cloud storages for a limited time
You can share Xvc tracked files from S3 compatible storages for a specified period.
$ xvc file share --storage my-storage dir-0001/file-0001.bin --duration 1h
You can share the link with others and they will be able to access to the file hour. The default period is 24 hours.
🥤Create a data pipeline
Suppose you have a script to preprocess files in a directory and you want to
run this when the files in my-data/train
directory changes. We first define a
step in the pipeline that will run the script.
$ xvc pipeline step new --step-name preprocess --command 'python3 src/'
Each command is associated with a step and each step has a command.
🔗 Add a dependency to a pipeline step
When we want to create a dependency for a command, we use [xvc pipeline step dependency
][xvc-pipeline-step-dependency] command with various parameters.
We want to define to dependencies for the preprocess
step we created previously.
We'll make preprocess
step to depend on:
- The
source file itself, so when we change the script, we'll run the step again
$ xvc pipeline step dependency --step-name preprocess --file src/
files that the script works on.
$ xvc pipeline step dependency -s preprocess --glob 'data/raw/*jpg'
⚠️ Most of the shells expand globs before running the command, so you need to quote glob to pass these as strings without expansion. Xvc expands these globs itself.
🛝 Run pipeline
After you define the pipeline, you can run it by:
$ xvc pipeline run
[DONE] preprocess (python3 src/
[OUT] [preprocess]
[DONE] preprocess (python3 src/
💡 Xvc runs pipeline steps in parallel if they are not interdependent. You can specify the maximum number of parallel processes.
🪡 Add fine grained dependencies to steps
Xvc allows many kinds of dependencies:
Steps can explicitly depend on other steps when they are required to run serially.
Steps can depend on single files or groups of files defined by globs. For globs, you can also get which files are added, deleted or updated with glob-items.
💡 Similar to Git, Xvc doesn't track directories per se. You can define glob dependencies that describe files in directory like
when you want to track all files in in. -
You can specify steps to depend only to a subset of lines in a file with line ranges or regular expressions. You can also get which lines are added, deleted or updated with more granular line-items or regex-items dependencies.
If you track (hyper)parameters for building/model training process in JSON or YAML files, you can specify steps to depend on these parameters.
If you want your steps to run when an HTTP(S) URL's content change, you can specify this with URL dependencies
If you want your step to run when the output from an SQLite query change, you can specify it with SQLite dependencies.
If none of the dependency types are fit for your needs, you can also specify a command that will be run to check if a step is invalidated.
🖇️ Example to add a dependency when only certain lines in a file change
Suppose you have a list of IQ scores in a file.
Ada Harris,128
Alan Thompson,125
Brian Shaffer,122
Brian Wilson,94
Dr. Brittany Chang,103
Brittany Smith,104
David Brown,113
Emily Davis,97
Grace White,130
James Taylor,101
Dr. Jane Doe,105
Jessica Lee,102
John Smith,110
Laura Martinez,110
Dr. Linus Martin,118
Mallory Johnson,105
Mallory Payne MD,99
Margaret Clark,122
Michael Johnson,92
Robert Anderson,105
Sarah Wilson,104
Sherry Brown,115
Sherry Leonard,117
Susan Davis,107
Dr. Susan Swanson,132
We're only interested in the IQ scores of those with Dr. in front of their names. Let's create a regex search dependency to run a command when only a line with a Dr. title is added to the file.
Our command will be collecting all lines with an initial Dr. to another file.
$ xvc pipeline step new --step-name dr-iq --command 'echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv '
$ xvc pipeline step dependency --step-name dr-iq --regex-items 'iq-scores.csv:/^Dr\..*'
The first line specifies a command, when run writes ${XVC_ADDED_REGEX_ITEMS}
environment variable to dr-iq-scores.csv
The second line specifies the dependency which will also populate the
environment variable in the command.
Some dependency types like regex items, line items and glob items inject environment variables to the shells running the step commands. If you have thousands of files specified by a glob, but want to run a script only on the added files after the last run, you can use these environment variables.
When you run the pipeline, a file named dr-iq-scores.csv
will be created.
$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv )
$ cat dr-iq-scores.csv
Dr. Brittany Chang,103
Dr. Jane Doe,105
Dr. Linus Martin,118
Dr. Susan Swanson,132
When the file changes, e.g. another line matching the dependency regex added
to the iq-scores.csv
file, the command will add to
$ zsh -cl 'echo "Dr. John Doe,123" >> iq-scores.csv'
$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv )
$ cat dr-iq-scores.csv
Dr. Brian Shaffer,122
Dr. Brittany Chang,82
Dr. Mallory Payne MD,70
Dr. Sherry Leonard,93
Dr. Susan Swanson,81
Dr. John Doe,123
has only the added lines, not all of the
lines the regex match. So, we can just work on the added elements, without
rerunning the commands for all matching elements.
🛃 Export, edit and import a pipeline with YAML or JSON files
Unlike some other tools, Xvc doesn't require (or allow) to specify pipelines in YAML files. Nevertheless, you can export and import the pipeline to JSON or YAML to edit in your editor. You can fix typos in commands, remove steps completely, or duplicate the pipeline with a new name this way.
$ xvc pipeline export --file my-pipeline.json
$ cat my-pipeline.json
"name": "default",
"steps": [
"command": "python3 -m pip install --quiet --user -r requirements.txt",
"dependencies": [
"File": {
"content_digest": {
"algorithm": "Blake3",
"digest": [
"path": "requirements.txt",
"xvc_metadata": {
"file_type": "File",
"modified": {
"nanos_since_epoch": [..],
"secs_since_epoch": [..]
"size": 14
"invalidate": "ByDependencies",
"name": "install-deps",
"outputs": []
"command": "python3",
"dependencies": [
"Step": {
"name": "install-deps"
"invalidate": "ByDependencies",
"name": "generate-data",
"outputs": []
"command": "echo /"${XVC_ADDED_REGEX_ITEMS}/" >> dr-iq-scores.csv ",
"dependencies": [
"RegexItems": {
"lines": [
"Dr. Brian Shaffer,122",
"Dr. Susan Swanson,81",
"Dr. Brittany Chang,82",
"Dr. Mallory Payne MD,70",
"Dr. Sherry Leonard,93",
"Dr. Albert Einstein,144"
"path": "iq-scores.csv",
"regex": "^Dr//..*",
"xvc_metadata": {
"file_type": "File",
"modified": {
"nanos_since_epoch": [..],
"secs_since_epoch": [..]
"size": 19021
"invalidate": "ByDependencies",
"name": "dr-iq",
"outputs": [
"File": {
"path": "dr-iq-scores.csv"
"command": "python3",
"dependencies": [
"File": {
"content_digest": null,
"path": "dr-iq-scores.csv",
"xvc_metadata": null
"invalidate": "ByDependencies",
"name": "visualize",
"outputs": []
"version": 1,
"workdir": ""
After you edit the file with changes, you can import the file to check its consistency and update the pipeline definition.
$ xvc pipeline import --file my-pipeline.json --overwrite
🎋 Visualize a pipeline in Graphviz or Mermaid
You can get the pipeline in Graphviz DOT format to convert to an image.
$ zsh -cl 'xvc pipeline dag --format graphviz | dot -opipeline.png'
You can also ask for a mermaid diagram;
xvc pipeline dag --format mermaid
flowchart TD
n1["data/*"] --> n0
n0["preprocess"] --> n2
You can embed this output in Markdown files, Github PRs or Jupyter notebooks.
Comparison with other tools
There are many similar tools for managing large files on Git, managing machine learning pipelines and experiments. Most of ML oriented tools are provided as SaaS and in a different vein than Xvc.
Similar tools for file management on Git are the following:
: See Xvc for DVC Users and Benchmarks against DVC documents for a detailed comparison.git-annex
: One of the earliest and most successful projects to manage large files with Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar toxvc storage new generic
. It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a recheck method It doesn't have data pipeline features.git-lfs
: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses.gitattributes
mechanism to track certain files by default. It doesn't have data pipeline features.
Adding completions to your shell
Xvc supports dynamic completions for bash, zsh, elvish, fish and powershell.
This means, when you hit TAB
in your shell, it calls Xvc to complete the
command. Even paths that are not visible in your filesystem or pipeline and
step names are completed this way.
In order to activate completions, run the following commands once in your shell:
echo "source <(COMPLETE=bash xvc)" >> ~/.bashrc
echo "eval (E:COMPLETE=elvish xvc | slurp)" >> ~/.elvish/rc.elv
echo "source (COMPLETE=fish xvc | psub)" >> ~/.config/fish/
$env:COMPLETE = "powershell"
echo "xvc | Out-String | Invoke-Expression" >> $PROFILE
Remove-Item Env:\COMPLETE
echo "source <(COMPLETE=zsh xvc)" >> ~/.zshrc
Nushell (without dynamic completions)
Until clap_complete_nushell
supports dynamic completions, similar to the above, you can create a completion script with xvc
and use it on your shell.
$ xvc _comp generate-nushell | save ($nu.config-path | path dirname | path join "")
$ use ($nu.config-path | path dirname | path join "") *
This will provide completions for commands and options. It won't work for dynamic completions like pipelines names, storage id's etc.
Compiling Xvc without default features
You may want to customize the feature set when you want a smaller binary size. Not everyone needs all storage options and turning off them may result in smaller binary sizes.
When you turn off all remote storage features, async runtime (tokio
) is also excluded from binary.
cargo build --no-default-features --release
Finished `release` profile [optimized] target(s) in 4.65s
Compiling Xvc without Reflink support
[reflink] crate may cause compilation errors on platforms where it's not supported.
Xvc adds a reflink
feature flag that's turned on by default. When reflink
causes errors, you can turn off default features and select only those you'll
cargo build --no-default-features --features "reflink" --release
Finished `release` profile [optimized + debuginfo] target(s) in 56.40s
Note that when you supply --no-default-features
, all other default features
like s3
etc are also turned off. You'll have to specify which features you
want in the features list. Otherwise Xvc cannot connect to your storages.
cargo build --no-default-features --features "s3,wasabi" --release
Finished `release` profile [optimized + debuginfo] target(s) in 56.40s
Configuration Files
Configure with Environment Variables
Changing configuration for a command
Get Started to Xvc
Xvc is a multipurpose tool. Its features can be used by professionals with various roles. If you're working with data, you can benefit from Xvc data management features.
Xvc for Everyone
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Hello tortoise. How are you? Let's take a selfie. Do you take selfies? I have lots of them. Terabytes of them.
🐢 I don't have much selfies, you know. I don't change quickly and scenery is changing less often.
🐇 I see. I have terabytes of them, but can't find a good solution to store them. How do you store your documents? I know you have documents, lots and lots of them.
🐢 I track them with Git to track my evolving thoughts on text files. Images are different. I think it's not a good idea to keep images on Git, but there is a tool for that.
🐇 What kind of tool? Not Git, but something different?
🐢 It's called Xvc. You can keep track of your selfies with it. You can backup them, and get them as needed.
🐇 Tell me more about it. I have a directory in my home, ~/Selfies
and I have thousands of them. How will I start?
🐢 Xvc can be used as a standalone tool but better when used with Git. You can just type
$ git init
$ xvc init
to start working with Xvc.
🐇 It looks easy but I heard that Git is complicated. Will I need to learn it?
🐢 Ah, no. If you're not willing to learn Git, you can just let Xvc to handle that. By default, it handles all Git operations about the changes it makes. If you want to push your files with someone, you may need to learn how to manage a repository.
🐇 How do I track my files?
🐢 You use xvc file track
command. Do you have directories in ~/Selfies
🐇 Yep. I have. Lots of them.
🐢 Do you want to track all of them?
🐇 Almost all. Some of them are so private that I want to hide even from Xvc.
🐢 You can use .xvcignore
file to list them. Xvc ignores the files you list in .xvcignore
🐇 How do I add others? Could you give an example?
🐢 If you have a folder for today's selfies, type this in ~/Selfies
$ xvc file track today/
and Xvc will track everything in that directory.
🐇 Oh, that's easy. If I want to track everything not ignored, I can type xvc file track
🐢 You're a quick learner.
After some brief period 🐇 went to home and added files.
🐇 Now, I want to learn how to share my selfies.
🐢 Xvc can store file contents in another location. First you must setup a storage. Do you use AWS S3?
🐇 Yes. I have buckets there. I want to keep my selfies in my rabbit-hole
🐢 You can configure Xvc to use it with xvc storage new s3
command. You'll specify the region and bucket, and Xvc will prepare it.
🐇 types
$ xvc storage new s3 --name selfies --region eu-lepus-1 --bucket rabbit-hole
🐢 Now, you can send your files there with xvc file send --to selfies
🐇 Is that all?
🐢 You will also need to push your Git files to another place. Do you have a Github account?
🐇 Ah, yeah, I have.
🐢 Now create a repository for your selfies. We will configure Git to use it as origin
$ git remote add origin🐇/selfies
$ git push --set-upstream origin main
Now, you can share your selfies with your friends.
🐇 Cool, but how Xvc knows my AWS password? Does it share my passwords?
🐢 No, never. You must allow your friends to read that bucket of yours. Xvc reads the credentials from AWS configuration, either from the file or the environment variables.
🐇 How will they get my files?
🐢 First, they must clone the repository.
$ git clone🐇/selfies
Then, they can get all files with:
$ cd selfies
$ xvc file get .
🐇 Oh, cool, they don't have to xvc init
again? Right?
🐢 No, they don't. Xvc should be initialized only once per repository. When you have new selfies, you can share them with:
$ xvc file track
$ git push
and your friends can receive the changes with
$ git pull
$ xvc file get
🐇 The order of these commands are important, it looks.
🐢 Yep. You add to Xvc first. Xvc automatically commits the changes to Git. Then you push Git changes to remote. Your friends first pull these changes, then get the actual files.
🐇 Thank you tortoise. Let me get back to my hole.
Xvc for Data
Xvc for Machine Learning
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Ah, hello tortoise. How are you? I began to work as an machine learning engineer, you know? I'll be the fastest.
🐢 You're quick as always, hare. How is your job going so far?
🐇 It's good. We have lots and lots of data. We have models. We have scripts to create those models. We have notebooks full of experiments. That's all good stuff. We'll solve the hare intelligence problem.
🐢 Sounds cool. Aren't you losing yourself in all these, though?
🐇 Time to time we have those moments. Some models work with some data, some experiments require some kind of preprocessing, some data changed since we started to work with it and now we have multiple versions.
🐢 I see. I began to use a tool called Xvc. It may be of use to you.
🐇 What does it do?
🐢 It keeps track of all these stuff you mentioned. Data, models, scripts. It also can detect when data changed and run the scripts associated with that data.
🐇 That sound something we need. My boss wanted me to build a pipeline for cat pictures. He makes a contest for cat pictures. Every time he finds a new cat picture he likes, we have to update the model.
🐢 He must have lots of cat pictures.
🐇 He has. He sometimes find higher resolution versions and replaces older pictures. He has terabytes of cat pictures.
🐢 How do you keep track of those versions?
🐇 We don't. We have a disk for cat pictures. He puts everything there and we train models with it.
🐢 You can use Xvc to version those files. You can go back and forth in time, or have different branches. It's based on Git.
🐇 I know, but Git is for code files, right? I never found a good way to store image files in Git. It stores everything.
🐢 Yep. Git keeps all history in each repository. Better to keep that terabytes of images away from Git. Otherwise, you'll have terabytes of cat pictures in each clone you use. Xvc helps there. It tracks contents of data files separately from Git. Image files are not put into Git objects, and they are not duplicated in all repositories.
🐇 You know, I'm not interested in details. Tell me how this works.
🐢 Ok. When you go back to cat picture directory, create a Git repository, and initialize Xvc immediately.
$ git init
$ xvc init
? 0
🐇 No messages?
🐢 Xvc is of silent type of Unix commands. It follows "no news is good news" principle. We use ? 0
to indicate the command return code. 0 means success. If you want more output, you can add -v
as a flag. Increase the number of -v
s to increase the details.
🐇 So -vvvvvvvvvvvvvvv
will show which atoms interact in disk while running Xvc?
🐢 It may work, try that next time. Now, you can add your cat pictures to Xvc. Xvc makes copies of tracked files by default. I assume you have a large collection. Better to make everything symlinks for now. We can change how specific files are linked to cache later.
$ xvc -v file track --as symlink .
🐇 Does it track everything that way?
🐢 Yes. If you want to track only particular files or directories, you can replace .
with their names.
🐇 What's the best recheck method for me?
🐢 If your file system supports, best way seems reflink
to me. It's like a symlink but makes a copy when your file changes. Most of the widely used file systems don't support it though. If your files are read only and you don't have many links to the same files, you can use hardlink
. If they are likely to change, you can use copy
. If there are many links to same files, better to use symlink
🐇 So, symlinks are not the best? Why did you select it?
🐢 I suspect most of the files in your cat pictures are duplicates. Xvc stores only one copy of these in cache and links all occurrences in the workspace to this copy. This is called deduplication. There are limits to number of hardlinks, so I recommended you to use symlinks. They are more visible. You can see they are links. Hardlinks are harder to detect.
🐇 Ah, when I type ls -l
, they all show the cache location now.
🐢 If you have a models/
directory and want to track them as copies, you can tell Xvc:
$ xvc file track --recheck-method copy models/
It replaces previous symlinks with the copies of the files only in models/
🐇 Can I have my data read only and models writable?
🐢 You can. Xvc keeps track of each file's recheck-method
separately. Data can stay in read-only symlinks, and models can be copied so they can be updated and stored as different versions.
🐇 I have also scripts, what should I do with them?
🐢 Are you using Git for them?
🐇 Yep. They are in a separate repository. I think I can use the same repository now.
🐢 You can. Better to keep them in the same repository. They can be versioned with the data they use and models they produce. You can use standard Git commands to track them. If you track a file with Git, Xvc doesn't track it. It stays away from it.
🐇 You said we can create pipelines with Xvc as well. I created a multi-stage pipeline for cat picture models. It's like this:
🐢 It looks like a fairly complex pipeline. You can create a pipeline definition for it. For each separate command we'll have a step. How many different commands do you have?
🐇 A preprocess --train
command, a preprocess --test
command, a train
command, a test
command and a deploy
command. Five.
🐢 Do you need more than one pipeline? Maybe you would like to put deployment to another pipeline?
🐇 No, I don't think so. I may have in the future.
🐢 Xvc has a default pipeline. We'll use it for now. If you need more pipelines you can create with xvc pipeline new
🐇 How do I create step for commands?
🐢 Let's create the steps at once. Each step requires a name and a command.
$ xvc pipeline step new --step-name preprocess-train --command 'python3 src/ --train data/cats data/pp-train/'
$ xvc pipeline step new --step-name preprocess-test --command 'python3 src/ --test data/cats data/pp-test/'
$ xvc pipeline step new --step-name train --command 'python3 src/ data/pp-train/'
$ xvc pipeline step new --step-name test --command 'python3 src/ data/pp-test/ metrics.json'
$ xvc pipeline step new --step-name deploy --command 'python3 models/model.bin /var/server/files/model.bin'
🐇 How do we define dependencies?
🐢 You can have many different types of dependencies. All are defined by xvc pipeline step dependency
command. You can set up direct dependencies between steps, if one is invalidated, its dependents also run. You can set up file dependencies, if the file changes the step is invalidated and requires to run. There are other, more detailed dependencies like parameter dependencies which take a file in JSON or YAML format, then checks whether a value has changed. There are regular expression dependencies, for example if you have a piece of code in your training script that you change to update the parameters, you can define a regex dependency.
🐇 It looks I can use this for CSV files as well.
🐢 Yes. If your step depends not on the whole CSV file, but only specific rows, you can use regex dependencies. You can also specify line numbers of a file to depend.
🐇 My
script depends on data/cats
directory. My
script depends on params.yaml
for some hyperparameters, and reads 5 Star
ratings from cat-contest.txt
. I want to deploy when the newly produced model is better than the older one by checking best-model.json
. My deployment script doesn't update the deployment if the new model is not the best.
🐢 Let's see. For each step, you can use a single command to define its dependencies. For
you'll depend to the data directory and the script itself. We want to run the step when the script changes. It's like this:
$ xvc pipeline step dependency --step-name preprocess-train --glob 'data/cats/*' --file src/
$ xvc pipeline step dependency --step-name preprocess-test --glob 'data/cats/*' --file src/
$ xvc pipeline step dependency --step-name train --glob 'data/pp-train/*' --file src/ --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'
$ xvc pipeline step dependency --step-name test --glob 'models/*' --directory data/pp-test/
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step dependency <--step-name <STEP_NAME>|--generic <GENERICS>|--url <URLS>|--file <FILES>|--step <STEPS>|--glob_items <GLOB_ITEMS>|--glob <GLOBS>|--param <PARAMS>|--regex_items <REGEX_ITEMS>|--regex <REGEXES>|--line_items <LINE_ITEMS>|--lines <LINES>|--sqlite-query <SQLITE_FILE> <SQLITE_QUERY>>
For more information, try '--help'.
$ xvc pipeline step dependency --step-name deploy --file best-model.json
You must also define the outputs these steps produce, so when the output is missing or dependency is newer than the output, the step will require to rerun.
$ xvc pipeline step output --step-name preprocess-train --directory data/pp-train
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name preprocess-test --directory data/pp-test
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name train --directory models/
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name test --file metrics.json --file best-model.json
? 2
error: unexpected argument '--file' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name deploy --file /var/server/files/model.bin
? 2
error: unexpected argument '--file' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
🐇 These commands become too long to type. You know, I'm a lazy hare and don't like to type much. Is there an easier way?
🐢 You can try source $(xvc aliases)
in your Bash or Zsh, and get a bunch of aliases for these commands. xvc pipeline step output
becomes xvcpso
, xvc pipeline step dependency
becomes xvcpsd
, etc. You can see the whole list:
$ xvc aliases
? 2
error: unrecognized subcommand 'aliases'
Usage: xvc [OPTIONS] <COMMAND>
For more information, try '--help'.
🐇 Oh, there are many more commands.
🐢 Yep. More to come, you can use xvc pipeline export
and after making the changes, you can use xvc pipeline import
🐇 I don't need to delete the pipeline to rewrite everything, then?
🐢 You can export a pipeline, edit and import with a different name to test. When you want to run them, you specify their names.
🐇 Ah, yeah, that's the most important part. How do I run?
🐢 xvc pipeline run
, or xvcpr
. It takes the name of the pipeline and runs it. It sorts steps, checks if there are any cycles. The steps musn't have cycles, otherwise it's an infinite loop and computers don't like infinite loops like turtles do. Xvc runs steps in parallel if there are no common dependencies.
🐇 So, if I have multiple preprocessing steps that don't depend each other, they can run in parallel?
🐢 Yeah, they run in parallel. For example in your pipeline preprocess-train
and preprocess-test
can run in parallel, because they don't depend on each other.
🐇 Cool. I want to see the pipeline we created.
🐢 You can see it with xvc pipeline dag
) It prints a mermaid.js diagram that you can paste to your files.
🐇 Better to have an image of this, maybe.
🐢 I'll inform the developer about it. Please tell him anything you'd like to see in the tool in Github or via email He's extremely introverted but tries to be a nice guy.
🐇 Ah, ok, I'll write to him about this.
Xvc for Software Development
Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.6 and DVC 2.30. Both commands are in development, and similarities and differences may change in time.
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC.
Xvc has the same optional and recommended reliance on Git but all features are available without Git. Xvc uses Git with its CLI interface like a user, without any reliance on a particular library.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
stage vs. step: What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
remmote vs storage: What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
pipeline definitions: In DVC, there is a 1-1 correspondence between
files in a repository and the pipelines. When you want to create a
new pipeline, you create a new file in DVC.
In Xvc, pipelines are abstract. They are defined with xvc pipeline
family of commands. No single file contains a
pipeline definition. You can export pipelines to
YAML, JSON, and TOML, and import them after
making changes. Xvc doesn't consider any file format authoritative for
pipelines, and their YAML/JSON/TOML representation may change between versions.
Files in the user workspace; DVC is more liberal in creating files among
user files in the repository. When you add a file to DVC with dvc add
creates a .dvc
file next to it. Xvc only creates a .xvc/
directory in the
repository root and only updates .gitignore
files to hide tracked files from
Git. You won't see any files added next to your data files.
cache-type vs recheck-method: Cache type, (or rather recheck method) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to the cache, another file copied from the cache, etc.
Command Differences
While naming Xvc commands, we tried our best to avoid name clashes with Git.
Having both git push
and dvc push
commands may look beneficial for
understanding at first, as these two are analogous. However, giving the same name
also hides important details that are more difficult to emphasize later. (e.g.
DVC experiments are Git objects that are pushed to Git remotes, while the
files changed during experiments are pushed to DVC remotes.)
dvc add
can be replaced by xvc file track
. dvc add
creates a .dvc
file (formatted in YAML) in the repository. Xvc doesn't
create separate files for tracked paths.
Instead of deleting .dvc
files to remove a file from DVC, you can use xvc file untrack
. It can also restore all versions of
an untracked file to a directory.
dvc check-ignore
can be replaced by xvc check-ignore
. Xvc version can be
used against any other ignore filename. (.gitignore
dvc checkout
is replaced by xvc file recheck
There is a --recheck-method
(shortened as --as
) option in several Xvc
commands to tell whether to check out as symlink, hardlink, reflink or copy.
dvc commit
is replaced by xvc file carry-in
. They
both cache the files if they are changed.
There is no command similar to dvc config
. You can either edit the
configuration files, or modify configuration with
options in each run. You can also supply all configuration from the
environment. See Configuration.
dvc dag
is replaced by xvc pipeline dag
. DVC version uses ASCII art to
present the pipeline. Xvc doesn't provide ASCII art, instead provides either a
Graphviz representation or mermaid diagram.
dvc data status
and dvc status
can be replaced by xvc file list
. Xvc
version doesn't provide information about the pipelines, or remote storages.
There is no command similar to dvc destroy
in Xvc. There will be an xvc deinit
command at some point. Until then, you can just
delete .xvc/
directory and all .xvcignore
files in your repository to
There is no command similar to dvc diff
in Xvc.
There is no command similar to dvc doctor
or dvc version
. Version
information should be visible in the help text. Unless compiled from source
with feature flags, Xvc binaries don't have feature
Currently, there are no commands corresponding to dvc exp
set of commands.
This is on the roadmap for Xvc. Scope, implementation, and actual commands may
dvc fetch
is replaced by xvc file bring --no-recheck
Instead of freezing "pipeline stages" as in dvc freeze
, and unfreezing with
dvc unfreeze
, xvc pipeline step update --changed [never|always|by_dependencies]
can be used to specify if/when to run a
pipeline step.
Instead of dvc gc
to "garbage-collect" files, you can use xvc file remove
with various options.
There is no corresponding command for dvc get-url
in Xvc. You can use
or curl
Currently there is no command to replace dvc get
and dvc import
, and dvc import-url
. URL dependencies are supported in the pipeline with xvc pipeline step dependency --url
Instead of dvc install
like hooks, Xvc issues Git commands itself if
, git.auto_stage
configuration options are set.
There is no corresponding command for dvc list-url
dvc list
is replaced by xvc file list
for local
paths. Its remote capabilities are not implemented but is on the roadmap.
Xvc doesn't mix files from different repositories in the same storage. There is an ID for each Xvc repo that's also used in remote storage paths.
Currently, there is no params/metrics tracking/diff similar to dvc params
dvc metrics
or dvc plots
commands in Xvc.
dvc move
is replaced by xvc file move
dvc push
is replaced by xvc file send
dvc pull
is replaced by xvc file bring
There are no commands similar to dvc queue
for experiments in Xvc.
Experiment tracking will probably be handled differently.
dvc remote
set of commands are replaced by xvc storage
set of commands.
You can use xvc storage new
for adding new storages. Currently, there is no
"default remote" facility in Xvc. Instead of dvc remote modify
, you can use
xvc storage remove
and xvc storage new
There is no single command to replace dvc remove
. For files, you can use
xvc file delete
. For pipelines steps, you can use
]xvc pipeline step remove
Instead of dvc repro
, Xvc has xvc pipeline run
. If you want to reproduce a pipeline, you can
use xvc pipeline run
xvc root
is for the same purpose as dvc root
dvc run
(that defines a stage in DVC pipeline and immediately runs it) can
be replaced by xvc pipeline
set of commands. xvc pipeline new
for a new pipeline, xvc pipeline step new
for a new step in the pipeline, xvc pipeline step dependency
to specify
dependencies of a step, xvc pipeline step output
to specify outputs of a step and
xvc pipeline run
to run this pipeline.
Instead of dvc stage add
, we have xvc pipeline step new
. For dvc stage list
, we have xvc pipeline step list
There is no (need) for dvc protect
or dvc unprotect
commands in Xvc.
"Cache type" of Xvc is not a repository-wide option, and called "recheck
method". If you want to track a certain directory as
symlink, and another as hardlink, you can do so with xvc file recheck --as
If you want identical files copied to one directory and linked in another,
xvc file copy
can help.
DVC needs dvc update
for external dependencies in pipelines. Xvc checks
their metadata like any other dependency before downloading and invalidates the
step if the URL/file has changed automatically.
DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Extra Features of Xvc
Xvc can use multiple of hashing functions, like BLAKE3, BLAKE2s, SHA2-256 and SHA3-256. More can be added upon request. The only requirement for hashes is having 32-hex digits (256 bits) of output.
In its pipelines, Xvc has more flexibility in defining dependencies. DVC supports files, directories and hyperparameters. Xvc supports additionally
- globs
- text file lines defined by line numbers,
- text file lines defined by regular expressions,
- URLs
- Sqlite queries,
Technical Differences
DVC is written in Python. Xvc is written in Rust.
DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
DVC tracks file/directory changes in separate
files. Xvc tracks them in.json
files in.xvc/store
. There is no 1-1 correspondence between these files and the directory structure. -
DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
) in its core. -
DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This provides inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated and when you want to delete all files associated with a repository, you can do so without the risk of deleting files used in other repositories.
DVC considers directories as file-equivalent entities to track with
files pointing to.json
files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. -
DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.
Benchmarking Xvc vs DVC
In this section, we'll write a few tests to see how Xvc and DVC perform in common tasks. This document is planned as reproducible to see the differences in performance. I'll update this time to time to see the differences, and I'll also add more tests.
This is mostly to satisfy my personal curiosity. I don't claim these are scientific experiments that describe the performance in all conditions.
We'll test the tools in the following scenarios:
- Checking in small files: We'll unzip 15.000 images from Chinese-MNIST dataset and measure the time for
dvc add
andxvc file track
- Checking out small files: We'll delete the files we track and recheck / checkout them using
dvc checkout
andxvc recheck
- Pushing/sending the small files we added to S3
- Pulling/bringing the small files we pushed from S3
- Checking in and out large files: We'll create 100 large files using
and repeat the above tests. - Running small pipelines: We'll create a pipeline with 10 steps to run simple commands.
- Running medium sized pipelines: We'll create a pipeline with 100 steps to run simple commands.
- Running large pipelines: We'll create a pipeline with 1000 steps to run simple commands.
This document uses the most recent versions of Xvc and DVC. DVC is installed via Homebrew.
$ dvc --version
$ xvc --version
xvc v0.6.4-alpha.0-300-g08c034a-modified
Init Repositories
Let's start by measuring the performance of initializing repositories.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ hyperfine -r 1 'xvc init'
Benchmark 1: xvc init
Time (abs ≡): 48.6 ms [User: 11.0 ms, System: 21.3 ms]
$ hyperfine -r 1 'dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"'
Benchmark 1: dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"
Time (abs ≡): 425.3 ms [User: 205.7 ms, System: 86.3 ms]
$ git status -s
Unzip the images
$ unzip -q
$ zsh -cl 'cp -r data/data xvc-data'
$ zsh -cl 'cp -r data/data dvc-data'
$ tree -d
├── data
│ └── data
├── dvc-data
└── xvc-data
5 directories
15K Small Files Performance
Xvc commits the changed metafiles automatically unless otherwise specified in the options. In the DVC command below, we also commit *.dvc
$ hyperfine -r 1 'xvc file track xvc-data/'
Benchmark 1: xvc file track xvc-data/
Time (abs ≡): 3.655 s [User: 0.931 s, System: 12.339 s]
$ hyperfine -r 1 --show-output 'dvc add dvc-data/ '
Benchmark 1: dvc add dvc-data/
To track the changes with git, run:
git add .gitignore dvc-data.dvc
To enable auto staging, run:
dvc config core.autostage true
Time (abs ≡): 13.027 s [User: 4.740 s, System: 6.765 s]
$ lsd -l
$ git status -s
M .gitignore
?? data/
?? dvc-data.dvc
Checkout a directory with 15K files
$ rm -rf xvc-data
$ hyperfine -r 1 'xvc file recheck xvc-data/'
Benchmark 1: xvc file recheck xvc-data/
Time (abs ≡): 2.378 s [User: 0.438 s, System: 2.152 s]
$ rm -rf dvc-data/
$ ls
$ hyperfine -r 1 --show-output 'dvc checkout dvc-data.dvc'
Benchmark 1: dvc checkout dvc-data.dvc
A dvc-data/
Time (abs ≡): 4.102 s [User: 1.399 s, System: 2.155 s]
Large File Performance
$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.669660 secs (628017680 bytes/sec)
$ hyperfine -r 1 'xvc file track xvc-large-file'
Benchmark 1: xvc file track xvc-large-file
Time (abs ≡): 1.499 s [User: 0.816 s, System: 0.805 s]
$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.446919 secs (724695716 bytes/sec)
$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"
To track the changes with git, run:
git add dvc-large-file.dvc .gitignore
To enable auto staging, run:
dvc config core.autostage true
[main 72fd199] Added dvc-large-file to DVC
2 files changed, 6 insertions(+)
create mode 100644 dvc-large-file.dvc
Time (abs ≡): 2.153 s [User: 1.906 s, System: 0.203 s]
Commit/Carry-in Large Files
$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550065 secs (676472277 bytes/sec)
$ hyperfine -r 1 'xvc file carry-in xvc-large-file'
Benchmark 1: xvc file carry-in xvc-large-file
Time (abs ≡): 1.024 s [User: 0.629 s, System: 0.393 s]
$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550363 secs (676342250 bytes/sec)
$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"
To track the changes with git, run:
git add dvc-large-file.dvc
To enable auto staging, run:
dvc config core.autostage true
[main c74d783] Added dvc-large-file to DVC
1 file changed, 1 insertion(+), 1 deletion(-)
Time (abs ≡): 2.098 s [User: 1.903 s, System: 0.189 s]
Pipeline with 10 Steps
Pipeline steps will depend on the following files.
$ xvc-test-helper create-directory-tree --directories 1 --files 10 --root pipeline-10
$ tree pipeline-10
└── dir-0001
├── file-0001.bin
├── file-0002.bin
├── file-0003.bin
├── file-0004.bin
├── file-0005.bin
├── file-0006.bin
├── file-0007.bin
├── file-0008.bin
├── file-0009.bin
└── file-0010.bin
2 directories, 10 files
Let's create 10 DVC stages to depend on these files:
$ zsh -cl "for f in pipeline-10/dir-0001/* ; do dvc stage add -q -n ${f:r:t} -d ${f} 'sha1sum $f'; done"
$ dvc stage list
file-0001 Depends on pipeline-10/dir-0001/file-0001.bin
file-0002 Depends on pipeline-10/dir-0001/file-0002.bin
file-0003 Depends on pipeline-10/dir-0001/file-0003.bin
file-0004 Depends on pipeline-10/dir-0001/file-0004.bin
file-0005 Depends on pipeline-10/dir-0001/file-0005.bin
file-0006 Depends on pipeline-10/dir-0001/file-0006.bin
file-0007 Depends on pipeline-10/dir-0001/file-0007.bin
file-0008 Depends on pipeline-10/dir-0001/file-0008.bin
file-0009 Depends on pipeline-10/dir-0001/file-0009.bin
file-0010 Depends on pipeline-10/dir-0001/file-0010.bin
Run the DVC pipeline
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 766.8 ms [User: 482.4 ms, System: 218.7 ms]
Running without changed the dependencies
$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
Time (mean ± σ): 455.8 ms ± 22.6 ms [User: 342.3 ms, System: 107.4 ms]
Range (min … max): 431.0 ms … 492.3 ms 5 runs
$ zsh -cl "for f in pipeline-10/dir-0001/* ; do xvc pipeline step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline step dependency -s ${f:r:t} --file ${f} ; done"
$ hyperfine -r 1 "xvc pipeline run"
Benchmark 1: xvc pipeline run
Time (abs ≡): 229.8 ms [User: 53.9 ms, System: 227.3 ms]
$ hyperfine -M 5 "xvc pipeline run"
Benchmark 1: xvc pipeline run
Time (mean ± σ): 176.8 ms ± 4.0 ms [User: 34.6 ms, System: 144.1 ms]
Range (min … max): 173.0 ms … 183.0 ms 5 runs
Pipeline with 100 Steps
Pipeline steps will depend on the following files.
$ xvc-test-helper create-directory-tree --directories 1 --files 100 --root pipeline-100
$ tree -d pipeline-100
└── dir-0001
2 directories
$ rm -f dvc.yaml
$ zsh -cl "for f in pipeline-100/dir-0001/* ; do dvc stage add -q -n s-${RANDOM} -d ${f} 'sha1sum $f'; done"
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 10.383 s [User: 8.813 s, System: 1.072 s]
$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
Time (mean ± σ): 637.3 ms ± 9.8 ms [User: 467.4 ms, System: 161.1 ms]
Range (min … max): 630.2 ms … 654.3 ms 5 runs
Let's create 100 Xvc steps to depend on the same files.
$ xvc pipeline new --pipeline-name p100
$ zsh -cl "for f in pipeline-100/dir-0001/* ; do xvc pipeline -p p100 step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline -p p100 step dependency -s ${f:r:t} --file ${f} ; done"
$ hyperfine -r 1 --show-output "xvc pipeline -p p100 run"
Benchmark 1: xvc pipeline -p p100 run
Time (abs ≡): 201.9 ms [User: 39.6 ms, System: 168.4 ms]
$ hyperfine -M 5 "xvc pipeline -p p100 run"
Benchmark 1: xvc pipeline -p p100 run
Time (mean ± σ): 198.7 ms ± 3.1 ms [User: 39.9 ms, System: 163.9 ms]
Range (min … max): 196.0 ms … 203.8 ms 5 runs
Note that the first run of the commands is drastically different. DVC runs all stages sequentially, in around 9.3 seconds while Xvc runs them in parallel in 0.2 seconds. Let's also measure the average run time of a sha1sum
command to consider how much of these passes in actual commands.
$ hyperfine 'sha1sum pipeline-100/dir-0001/file-0001.bin'
Benchmark 1: sha1sum pipeline-100/dir-0001/file-0001.bin
Time (mean ± σ): 1.2 ms ± 0.2 ms [User: 0.4 ms, System: 0.5 ms]
Range (min … max): 0.9 ms … 2.7 ms 535 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Pipeline with 1000 Steps
In this case we'll just measure the run times of 10000 ls
$ rm -f dvc.yaml
$ zsh -cl "for i in {1..1000}; do dvc stage add -q -n s-${i} 'ls'; done"
$ zsh -cl 'dvc stage list | wc -l'
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 469.534 s [User: 449.463 s, System: 17.257 s]
$ hyperfine -M 5 "dvc repro"
? interrupted
Benchmark 1: dvc repro
$ xvc pipeline new --pipeline-name p1000
$ zsh -cl "for i in {1..1000} ; do xvc --skip-git pipeline -p p1000 step new -s s-${i} --command 'ls' ; done"
$ zsh -cl 'xvc pipeline step list --names-only | wc -l'
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
$ hyperfine -r 1 --show-output "xvc pipeline -p p1000 run"
Benchmark 1: xvc pipeline -p p1000 run
Time (abs ≡): 460.0 ms [User: 78.7 ms, System: 376.8 ms]
$ hyperfine -M 5 "xvc pipeline -p p1000 run"
Benchmark 1: xvc pipeline -p p1000 run
Time (mean ± σ): 404.5 ms ± 10.6 ms [User: 79.0 ms, System: 366.7 ms]
Range (min … max): 397.4 ms … 423.2 ms 5 runs
How-To Guides
How to Compile Xvc
Why would you compile?
- You want to use Xvc on a platform that we don't distribute the binary.
- You want a smaller binary size by removing features that you don't use.
- You like your software compiled.
- It's easier to use
than other means to install for you. - Fix a bug for yourself.
- Contribute!
Install Rust
You must have Rust installed on your system.
If you have a sensible terminal on your system:
$ curl --proto '=https' --tlsv1.2 -sSf | sh
Otherwise refer to other installation methods page.
Clone the repository
Clone the repository from Emre's Github repository.
$ git clone -b latest
The latest
tag refers to the latest stable release. If you're willing to fight with compilation errors, you can also use main
branch directly.
Compile without default features
Xvc with Git Branches
When you're working with multiple branches in Git, you may ask Xvc to checkout a branch and commit to another branch.
These operations are performed at the beginning, and at the end of Xvc operations.
You can use --from-ref
and --to-branch
options to checkout a Git reference before an Xvc operation, and commit the results to a certain Git branch.
Checkout and commit operations sandwich Xvc operations.
If --from-ref
is not given, initial git checkout
is not performed.
Xvc operates in the current branch.
This is the default behavior.
$ git init --initial-branch=main
$ xvc init
? 0
$ ls
$ xvc --to-branch data-file file track data.txt
Switched to a new branch 'data-file'
$ git branch
* data-file
$ git status -s
$ xvc file list data.txt
FC 19 2023-06-08 11:47:18 c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
If you return to main
branch, you'll see the file is tracked by neither Git nor Xvc.
$ git checkout main
$ xvc file list data.txt
FX 19 2023-06-08 11:47:18 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 0
$ git status -s
?? data.txt
Now, we'll add a step to the default pipeline to get an uppercase version of the data. We want this to work only in data
$ xvc --from-ref data-file pipeline step new --step-name to-uppercase --command 'cat data.txt | tr a-z A-Z > uppercase.txt'
Switched to branch 'data-file'
$ xvc pipeline step dependency --step-name to-uppercase --file data.txt
$ xvc pipeline step output --step-name to-uppercase --output-file uppercase.txt
Note that xvc pipeline step dependency
and xvc pipeline step output
commands don't need --from-ref
and --to-branch
options, as they run in data-file
branch already.
Now, we want to have this new version of data available only in uppercase
$ xvc --from-ref data-file --to-branch uppercase pipeline run
Already on 'data-file'
[DONE] to-uppercase (cat data.txt | tr a-z A-Z > uppercase.txt)
Switched to a new branch 'uppercase'
$ git branch
* uppercase
You can use this for experimentation.
Whenever you have a pipeline that you want to run and keep the results in another Git branch, you can use --to-branch
for experimentation.
$ xvcpr --from-ref data-file --to-branch another-uppercase
$ git-branch
* another-uppercase
The pipeline always runs, because in data-file
branch uppercase.txt
is always missing.
It's stored only in the resulting branch you give by --to-branch
Turning off Automated Git Operations
By default Xvc automates all common git operations. When you run an Xvc operation that affects the files under .xvc
directory, the changes are committed to the repository automatically.
Git autmation runs in Git repositories.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ xvc init
We'll show these examples in the following directory tree.
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20231012
$ tree
└── dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
2 directories, 3 files
When you begin to track a file in the repository, Xvc adds the file to .gitignore in the directory the file is found.
$ xvc file track dir-0001/file-0001.bin
$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
Xvc also adds a commit for all the changes caused by the command.
$ git log -n 1
commit [..]
Author: [..]
Date: [..]
Xvc auto-commit after '[..]xvc file track dir-0001/file-0001.bin'
The commit message includes the command you gave to run to find the exact change in history.
If you don't track a file with Xvc, they are not added to .gitignore
and you can see them with git status
$ git status -s
?? dir-0001/file-0002.bin
?? dir-0001/file-0003.bin
If you want to skip this automated Git operations, you can add --skip-git
flag to commands.
$ xvc --skip-git file track dir-0001/file-0002.bin
$ git status -s
M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? dir-0001/file-0003.bin
Note that, --skip-git
flag doesn't affect the files to be added to .gitignore
$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
### Following 1 lines are added by xvc on [..]
You can use usual Git workflow to add and commit the files.
$ git add .xvc dir-0001/.gitignore
$ git commit -m "Began to track dir-0001/file-0002.bin with Xvc"
[main [..]] Began to track dir-0001/file-0002.bin with Xvc
7 files changed, 8 insertions(+)
create mode 100644 .xvc/ec/[..]
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
If you never want Xvc to handle commits, you can set git.use_git
option in
file to false or set XVC_git.use_git=false
in the environment.
$ XVC_git.use_git=false xvc file track dir-0001/file-0003.bin
$ git status -s
M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
How to create a data pipeline with Xvc
A data pipeline starts from data and ends with models. Between there is various data transformations and model training. We try to make all pieces reproducible and Xvc helps with this goal.
In this document, we'll create the following pipeline for a digit recognition system. Our purpose is to show how Xvc helps in versioning data, so this document doesn't try to achieve a high classification performance.
This document can be more verbose than usual, because all commands in this document are run on a clean directory during tests to check outputs. Some of the idiosyncrasies, e.g., running certain commands with zsh -c
are due to this reason.
Although you can do without, most of the times Xvc runs in a Git repository. This allows to version control both the data and the code together.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ xvc init
In this HOWTO, we use Chinese MNIST dataset to create an image classification pipeline. We already downloaded it from kaggle.
$ ls -l
total 21112
-rw-r--r-- 1 iex staff 10792680 Nov 17 19:46
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52
Let's start by tracking the data file with Xvc.
$ xvc file track --as symlink
The default recheck (checkout) method is copy that means the file is duplicated in the workspace as a writable file. We don't need to write over this data file, we'll only read from it, so we set the recheck type as symlink.
$ ls -l
total 32
lrwxr-xr-x 1 iex staff 195 Dec 2 12:10 -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52
The long directory name is the BLAKE-3 hash of the data file.
As we'll work with the file contents, let's unzip the data file.
$ unzip -q
$ ls -l
total 32
lrwxr-xr-x 1 iex staff 195 Dec 2 12:10 -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/
drwxr-xr-x 4 iex staff 128 Nov 17 19:45 data
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52
Now we have the data directory with the following structure:
$ tree -d data
└── data
2 directories
Let's track the data directory as well with Xvc.
$ xvc file track data --as symlink
The reason we're tracking the data directory separately is that we'll use different subsets as training, validation, and test data.
Let's list the track status of files first.
$ xvc file list data/data/input_9_9_*
SS [..] 3a714d65 data/data/input_9_9_9.jpg
SS [..] 9ffccc4d data/data/input_9_9_8.jpg
SS [..] 5d6312a4 data/data/input_9_9_7.jpg
SS [..] 7a0ddb0e data/data/input_9_9_6.jpg
SS [..] 2047d7f3 data/data/input_9_9_5.jpg
SS [..] 10fcf309 data/data/input_9_9_4.jpg
SS [..] 0bdcd918 data/data/input_9_9_3.jpg
SS [..] aebcbc03 data/data/input_9_9_2.jpg
SS [..] 38abd173 data/data/input_9_9_15.jpg
SS [..] 7c6a9003 data/data/input_9_9_14.jpg
SS [..] a9f04ad9 data/data/input_9_9_13.jpg
SS [..] 2d372f95 data/data/input_9_9_12.jpg
SS [..] 8fe799b4 data/data/input_9_9_11.jpg
SS [..] ee35e5d5 data/data/input_9_9_10.jpg
SS [..] 7576894f data/data/input_9_9_1.jpg
Total #: 15 Workspace Size: 2925 Cached Size: 8710
xvc file list
command shows the tracking status. Initial two characters shows
the tracking status, SS
means the file is tracked as symlink and is available
in the workspace as a symlink. The next column shows the file size, then the
last modified date, then the BLAKE-3 hash of the file, and finally the file
name. The empty column contains the actual hash of the file if the file is
available in the workspace. Here it's empty because the workspace file is a
link to the file in cache.
The summary line shows the total size of the files and the size they occupy in the workspace.
Splitting Train, Validation, and Test Sets
The first step of the pipeline is to create subsets of the data.
The data set contains 15 classes. It has 10 samples for each of these classes from 100 different people. As we'll train a Chinese digit recognizer, we'll first divide volunteers 1-60 for training, 61-80 for validation, and 81-100 for testing. This will ensure that the model is not trained with the same person's handwriting.
$ xvc file copy --name-only data/data/input_?_* data/train/
$ xvc file copy --name-only data/data/input_[12345]?_* data/train/
$ xvc file copy --name-only data/data/input_100_* data/train/
$ xvc file copy --name-only data/data/input_[67]?_* data/validate/
$ xvc file copy --name-only data/data/input_[89]?_* data/test/
$ tree -d data/
├── data
├── test
├── train
└── validate
5 directories
If you look at the contents of these directories, you'll see that they are symbolic links to the same files we started to track.
Let's check the number of images in each set.
$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
$ zsh -c 'ls -1 data/validate/*.jpg | wc -l'
$ zsh -c 'ls -1 data/test/*.jpg | wc -l'
The first step in the pipeline will be rechecking (checking out) these subsets.
$ xvc pipeline step new -s recheck-data --command 'xvc file recheck data/train/ data/validate/ data/test/'
xvc file recheck
is used in to instate files from Xvc cache.
Let's test the pipeline by first deleting the files we manually created.
$ rm -rf data/train data/validate data/test
We run the steps we created.
$ xvc pipeline run
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
If we check the contents of the directories, we'll see that they are back.
$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
Preprocessing Images into Numpy Arrays
The Python script to train a model runs with Numpy arrays. So we'll convert each of these directories with images into two numpy arrays. One of the arrays will keep $n$ 64x64 images and the other will keep $n$ labels for these images.
$ xvc pipeline step new --step-name create-train-array --command '.venv/bin/python3 --dir data/train/'
$ xvc pipeline step new --step-name create-test-array --command '.venv/bin/python3 --dir data/test/'
$ xvc pipeline step new --step-name create-validate-array --command '.venv/bin/python3 --dir data/validate/'
These commands will run when the image files in those directories will change. Xvc can keep track of file groups and invalidate a step when the content of any of these files change. Moreover, it's possible to track which files have changed if there are too many files. We don't need this feature of tracking individual items in globs, so we'll use a glob dependency.
$ xvc pipeline step dependency --step-name create-train-array --glob 'data/train/*.jpg'
$ xvc pipeline step dependency --step-name create-test-array --glob 'data/test/*.jpg'
$ xvc pipeline step dependency --step-name create-validate-array --glob 'data/validate/*.jpg'
Now we have three more steps that depend on changed files. The script depends on OpenCV to read images. Python best practices recommend to create a separate virtual environment for each project. We'll also make sure that the venv is created and the requirements are installed before running the script.
Create a command to initialize the virtual environment. It will run if there is no .venv/bin/activate
$ xvc pipeline step new --step-name init-venv --command 'python3 -m venv .venv'
$ xvc pipeline step dependency --step-name init-venv --generic 'echo "$(hostname)/$(pwd)"'
We used --generic
dependency that runs a command and checks its output to see whether the step requires to be run again. We only want to run init-env
once per deployment, so checking output of hostname
and pwd
is better than existence of a file. File dependencies must be available before running the pipeline to record their metadata. There is no such restriction for generic dependencies.
Then, another step that depends on init-venv
and requirements.txt
will install the dependencies.
$ xvc pipeline step new --step-name install-requirements --command '.venv/bin/python3 -m pip install -r requirements.txt'
$ xvc pipeline step dependency --step-name install-requirements --step init-venv
$ xvc pipeline step dependency --step-name install-requirements --file requirements.txt
Note that, unlike other tools, you can specify direct dependencies between steps in Xvc. When a pipeline step must wait another step to finish successfully, a dependency between these two can be defined.
The above create-*-array
steps will depend on to install-requirements
to ensure that requirements are installed when the scripts are run.
$ xvc pipeline step dependency --step-name create-train-array --step install-requirements
$ xvc pipeline step dependency --step-name create-validate-array --step install-requirements
$ xvc pipeline step dependency --step-name create-test-array --step install-requirements
Now, as the pipeline grows, it may be nice to see the graph what we have done so far.
$ xvc pipeline dag --format mermaid
flowchart TD
n2["data/train/*.jpg"] --> n1
n3["install-requirements"] --> n1
n5["data/test/*.jpg"] --> n4
n3["install-requirements"] --> n4
n7["data/validate/*.jpg"] --> n6
n3["install-requirements"] --> n6
n9["echo "$(hostname)/$(pwd)""] --> n8
n8["init-venv"] --> n3
n10["requirements.txt"] --> n3
command can also produce GraphViz DOT output. For larger graphs, it may be more suitable. We'll use DOT to create images in later sections.
Let's run the pipeline at this point to test.
$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/] Pipeline Graph:
digraph {
0 [ label = "(30009, 11376621678660215310)" ]
1 [ label = "(30012, 12907533602545881359)" ]
2 [ label = "(30010, 8484021102039729264)" ]
3 [ label = "(30011, 9338166212381570306)" ]
4 [ label = "(30016, 17450406389616117859)" ]
5 [ label = "(30018, 2681008057348839262)" ]
1 -> 5 [ label = "Step" ]
2 -> 5 [ label = "Step" ]
3 -> 5 [ label = "Step" ]
5 -> 4 [ label = "Step" ]
[INFO] No dependency steps for step recheck-data
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] No dependency steps for step init-venv
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-test-array
[INFO] Waiting for dependency steps for step create-train-array
[INFO] [init-venv] Dependencies has changed
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[DONE] init-venv (python3 -m venv .venv)
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] Dependencies has changed
[OUT] [install-requirements] Collecting opencv-python (from -r requirements.txt (line 1))
Using cached opencv_python- (19 kB)
Collecting torch (from -r requirements.txt (line 2))
Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting pyyaml (from -r requirements.txt (line 3))
Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting scikit-learn (from -r requirements.txt (line 4))
Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting numpy>=1.21.2 (from opencv-python->-r requirements.txt (line 1))
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Collecting filelock (from torch->-r requirements.txt (line 2))
Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions (from torch->-r requirements.txt (line 2))
Using cached typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch->-r requirements.txt (line 2))
Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch->-r requirements.txt (line 2))
Using cached networkx-3.2.1-py3-none-any.whl.metadata (5.2 kB)
Collecting jinja2 (from torch->-r requirements.txt (line 2))
Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting fsspec (from torch->-r requirements.txt (line 2))
Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=1.5.0 (from scikit-learn->-r requirements.txt (line 4))
Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl.metadata (165 kB)
Collecting joblib>=1.1.1 (from scikit-learn->-r requirements.txt (line 4))
Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn->-r requirements.txt (line 4))
Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch->-r requirements.txt (line 2))
Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch->-r requirements.txt (line 2))
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached opencv_python- (33.1 MB)
Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl (59.6 MB)
Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl (167 kB)
Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl (9.4 MB)
Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Using cached filelock-3.13.1-py3-none-any.whl (11 kB)
Using cached fsspec-2023.10.0-py3-none-any.whl (166 kB)
Using cached networkx-3.2.1-py3-none-any.whl (1.6 MB)
Using cached typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl (17 kB)
Installing collected packages: mpmath, typing-extensions, threadpoolctl, sympy, pyyaml, numpy, networkx, MarkupSafe, joblib, fsspec, filelock, scipy, opencv-python, jinja2, torch, scikit-learn
Successfully installed MarkupSafe-2.1.3 filelock-3.13.1 fsspec-2023.10.0 jinja2-3.1.2 joblib-1.3.2 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.2 opencv-python- pyyaml-6.0.1 scikit-learn-1.3.2 scipy-1.11.4 sympy-1.12 threadpoolctl-3.2.0 torch-2.1.1 typing-extensions-4.8.0
[DONE] install-requirements (.venv/bin/python3 -m pip install -r requirements.txt)
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] [create-validate-array] Dependencies has changed
[INFO] [create-train-array] Dependencies has changed
[INFO] [create-test-array] Dependencies has changed
[DONE] create-validate-array (.venv/bin/python3 --dir data/validate/)
[DONE] create-test-array (.venv/bin/python3 --dir data/test/)
[DONE] create-train-array (.venv/bin/python3 --dir data/train/)
Now, when we take a look at the data directories, we find images.npy
and classes.npy
$ zsh -cl 'ls -l data/train/*.npy'
-rw-r--r-- 1 iex staff 72128 Dec 2 12:11 data/train/classes.npy
-rw-r--r-- 1 iex staff 110592128 Dec 2 12:11 data/train/images.npy
$ zsh -cl 'ls -l data/test/*.npy'
-rw-r--r-- 1 iex staff 24128 Dec 2 12:11 data/test/classes.npy
-rw-r--r-- 1 iex staff 36864128 Dec 2 12:11 data/test/images.npy
$ zsh -cl 'ls -l data/validate/*.npy'
-rw-r--r-- 1 iex staff 24128 Dec 2 12:11 data/validate/classes.npy
-rw-r--r-- 1 iex staff 36864128 Dec 2 12:11 data/validate/images.npy
Train a model
Now we have built the NumPy arrays, we can train a model. We'll use a simple convolutional neural network as a showcase. This is by no means a state-of-art solution, so the results will be less than perfect.
The script receives training, validation and testing directories, loads the data from Numpy arrays we just produced, loads hyperparameters from a file called params.yaml
, trains the model, tests it and writes the results and model to a file. It's a very involved piece produced with the assistance of GPT-4.
We first define the step to run the command:
$ xvc pipeline step new --step-name train-model --command '.venv/bin/python3 --train_dir data/train/ --val_dir data/validate --test_dir data/test'
The step will depend to array generation steps by depending on the files they produce. In order to define a dependency between train-model
and create-train-array
step, we must tell that create-array-dependency
outputs a file called images.npy
. We can do this by using --file
option of step output
$ xvc pipeline step output --step-name create-train-array --output-file data/train/images.npy
$ xvc pipeline step output --step-name create-train-array --output-file data/train/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/train/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/train/classes.npy
Note that this operation is different from creating a direct dependency between steps. There may be multiple steps creating the same outputs and there may be multiple steps depending on the same files. Preferring direct (--step
) dependencies and indirect (--file
) dependencies is a matter of taste and use.
We'll create these dependencies for other files as well.
$ xvc pipeline step output --step-name create-test-array --output-file data/test/images.npy
$ xvc pipeline step output --step-name create-test-array --output-file data/test/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/test/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/test/classes.npy
$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/images.npy
$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/validate/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/validate/classes.npy
Before running the pipeline, let's see the pipeline DAG once more. This time in DOT format.
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="recheck-data";];n1[shape=box;label="create-train-array";];n2[shape=folder;label="data/train/*.jpg";];n2->n1;n3[shape=box;label="install-requirements";];n3->n1;n4[shape=note;color=black;label="data/train/images.npy";];n1->n4;n5[shape=note;color=black;label="data/train/classes.npy";];n1->n5;n6[shape=box;label="create-test-array";];n7[shape=folder;label="data/test/*.jpg";];n7->n6;n3[shape=box;label="install-requirements";];n3->n6;n8[shape=note;color=black;label="data/test/images.npy";];n6->n8;n9[shape=note;color=black;label="data/test/classes.npy";];n6->n9;n10[shape=box;label="create-validate-array";];n11[shape=folder;label="data/validate/*.jpg";];n11->n10;n3[shape=box;label="install-requirements";];n3->n10;n12[shape=note;color=black;label="data/validate/images.npy";];n10->n12;n13[shape=note;color=black;label="data/validate/classes.npy";];n10->n13;n14[shape=box;label="init-venv";];n15[shape=trapezium;label="echo /"$(hostname)/$(pwd)/"";];n15->n14;n3[shape=box;label="install-requirements";];n14[shape=box;label="init-venv";];n14->n3;n16[shape=note;label="requirements.txt";];n16->n3;n17[shape=box;label="train-model";];n4[shape=note;label="data/train/images.npy";];n4->n17;n5[shape=note;label="data/train/classes.npy";];n5->n17;n8[shape=note;label="data/test/images.npy";];n8->n17;n9[shape=note;label="data/test/classes.npy";];n9->n17;n12[shape=note;label="data/validate/images.npy";];n12->n17;n13[shape=note;label="data/validate/classes.npy";];n13->n17;}
It's not the most readable graph description but you can feed the output to dot
command to create an SVG file.
$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline1.svg'
Note that, as we forgot to create a params.yaml
file containing the hyperparameters. When a step in the pipeline doesn't run successfully, its dependent steps won't be run. Let's add a params.yaml
file and add it as a dependency to the train step.
$ zsh -cl 'echo "batch_size: 4" > params.yaml'
$ zsh -cl 'echo "epochs: 2" >> params.yaml'
$ xvc pipeline step dependency --step-name train-model --param params.yaml::batch_size
$ xvc pipeline step dependency --step-name train-model --param params.yaml::epochs
With the above commands, the pipeline depends directly to these values. Even if
the file contains other values, changing them won't invalidate the
We can also specify the model and the results as output and the graph will show them.
$ xvc pipeline step output --step-name train-model --output-file model.pth
$ xvc pipeline step output --step-name train-model --output-metric results.json
Let's see the pipeline DAG once more:
$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline2.svg'
We're ready to run the pipeline and train the model.
$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/images.npy"))
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/classes.npy"))
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/images.npy"))
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/classes.npy"))
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/images.npy"))
[INFO][pipeline/src/pipeline/] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/classes.npy"))
[INFO][pipeline/src/pipeline/] Pipeline Graph:
digraph {
0 [ label = "(30024, 14850552671149047786)" ]
1 [ label = "(30009, 11376621678660215310)" ]
2 [ label = "(30011, 9338166212381570306)" ]
3 [ label = "(30010, 8484021102039729264)" ]
4 [ label = "(30012, 12907533602545881359)" ]
5 [ label = "(30016, 17450406389616117859)" ]
6 [ label = "(30018, 2681008057348839262)" ]
2 -> 6 [ label = "Step" ]
3 -> 6 [ label = "Step" ]
4 -> 6 [ label = "Step" ]
6 -> 5 [ label = "Step" ]
0 -> 2 [ label = "File" ]
0 -> 3 [ label = "File" ]
0 -> 4 [ label = "File" ]
[INFO] No dependency steps for step init-venv
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] Waiting for dependency steps for step train-model
[INFO] No dependency steps for step recheck-data
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-train-array
[INFO] Waiting for dependency steps for step create-test-array
[INFO] [init-venv] No changed dependencies. Skipping thorough comparison.
[INFO] [init-venv] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] No changed dependencies. Skipping thorough comparison.
[INFO] [install-requirements] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] [create-test-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-test-array] No missing Outputs and no changed dependencies
[INFO] [create-validate-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-validate-array] No missing Outputs and no changed dependencies
[INFO] [create-train-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-train-array] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step train-model
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[INFO] [train-model] Dependencies has changed
[OUT] [train-model] [1, 2000] loss: 0.921
Accuracy of the network on the validation images: 72 %
[2, 2000] loss: 0.426
Accuracy of the network on the validation images: 83 %
Confusion Matrix:
[[174 0 0 1 2 0 1 2 0 2 0 14 0 1 3]
[ 1 132 60 0 0 0 1 0 0 0 0 5 1 0 0]
[ 3 1 157 34 0 0 3 0 0 0 1 1 0 0 0]
[ 2 0 34 160 0 2 2 0 0 0 0 0 0 0 0]
[ 1 0 0 0 186 0 0 1 0 2 0 9 0 0 1]
[ 3 0 11 12 0 145 1 0 0 9 1 12 3 2 1]
[ 3 1 1 0 1 0 133 8 16 9 6 10 2 10 0]
[ 0 0 0 0 3 1 5 145 3 8 25 2 1 1 6]
[ 0 0 0 0 0 0 1 1 181 4 1 1 0 4 7]
[ 2 0 0 0 2 1 0 3 7 142 4 3 0 7 29]
[ 0 0 0 0 1 0 1 0 0 1 193 2 2 0 0]
[ 4 0 0 0 21 4 0 5 1 1 4 152 1 4 3]
[ 0 1 1 1 0 1 3 1 0 0 55 4 132 0 1]
[ 5 0 0 0 2 0 0 2 0 0 1 36 0 153 1]
[ 0 0 0 0 8 0 0 1 2 5 0 0 0 7 177]]
[DONE] train-model (.venv/bin/python3 --train_dir data/train/ --val_dir data/validate --test_dir data/test)
We now have a model and a result file. Let's track the model with Xvc as well.
$ xvc file track model.pth results.json
Sharing Data and Models
Sharing a machine learning project with Xvc means to share the Git repository and the data and model files that are tracked by Xvc in this repository. For the first, we can use any kind of Git remote, e.g. Github. Xvc doesn't require any special setup (like Git-LFS) to share binary files.
In order to share the binary files, we need to specify an Xvc storage. This can be on a local folder, an SSH host with rsync, AWS S3 bucket or any of the supported storage backends. (See xvc storage new
documentation for the full list.)
In this example, we'll create a new S3 bucket and share all files there.
$ xvc storage new s3 --name my-s3 --bucket-name xvc-test --region eu-central-1 --storage-prefix how-to-create-a-pipeline
$ xvc file send
? 2
error: the following required arguments were not provided:
--remote <REMOTE>
Usage: xvc file send --remote <REMOTE> [TARGETS]...
For more information, try '--help'.
These two commands will define a new remote storage and sends all files to this storage. When you want to share the pipeline and all code and data it runs with, they can clone the repository and run the following command to get the files. Don't forget to push the most recent version of your repository.
$ git push
# On another machine
$ git clone
$ xvc file bring
Note that, the second time there is no need to configure the remote storage, but the user must have AWS credentials in their environment. You can also automate this on Github and train your pipelines on CI.
In this how-to we created an end-to-end machine learning pipeline. Please ask about any issues that are not clear in the comment box below. Thank you for reading so far.
Command Reference
$ xvc --help
Xvc CLI to manage data and ML pipelines
Usage: xvc [OPTIONS] <COMMAND>
file File and directory management commands
init Initialize an Xvc project
pipeline Pipeline management commands
storage Storage (cloud) management commands
root Find the root directory of a project
check-ignore Check whether files are ignored with `.xvcignore`
aliases Print command aliases to be sourced in shell files
help Print this message or the help of the given subcommand(s)
-v, --verbose... Output verbosity. Use multiple times to increase the output detail
--quiet Suppress all output
--debug Turn on all logging to $TMPDIR/xvc.log
-C <WORKDIR> Set working directory for the command. It doesn't create a new shell, or change the directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value You can use multiple times
--no-system-config Ignore system configuration file
--no-user-config Ignore user configuration file
--no-project-config Ignore project configuration file (.xvc/config)
--no-local-config Ignore local (gitignored) configuration file (.xvc/config.local)
--no-env-config Ignore configuration options obtained from environment variables
--skip-git Don't run automated Git operations for this command. If you want to run git commands yourself all the time, you can set `git.auto_commit` and `git.auto_stage` options in the configuration to False
--from-ref <FROM_REF> Checkout the given Git reference (branch, tag, commit etc.) before performing the Xvc operation. This runs `git checkout <given-value>` before running the command
--to-branch <TO_BRANCH> If given, create (or checkout) the given branch before committing results of the operation. This runs `git checkout --branch <given-value>` before committing the changes
-h, --help Print help
-V, --version Print version
: File and directory management commandsinit
: Initialize an Xvc projectpipeline
: Pipeline management commandsstorage
: Storage (cloud) management commandsroot
: Find the root directory of a projectcheck-ignore
: Check whether files are ignored with.xvcignore
Print command aliases to be sourced in shell files
xvc init
$ xvc init --help
Initialize an Xvc project
Usage: xvc init [OPTIONS]
--path <PATH> Path to the directory to be intialized. (default: current directory)
--no-git Don't require Git
--force Create the repository even if already initialized. Overwrites the current .xvc directory Resets all data and guid, etc
-h, --help Print help
-V, --version Print version
To initialize a blank Xvc repository, initialize Git first and run xvc init
$ cd my-project-1
$ git init
$ xvc init
? 0
The command doesn't print anything upon success.
If you want to initialize
File Management
$ xvc file --help
File and directory management commands
Usage: xvc file [OPTIONS] <COMMAND>
track Add file and directories to Xvc [aliases: t]
hash Get digest hash of files with the supported algorithms [aliases: h]
recheck Get files from cache by copy or *link [aliases: checkout, r]
carry-in Carry in changed files to cache [aliases: commit, c]
copy Copy from source to another location in the workspace [aliases: C]
move Move files to another location in the workspace [aliases: M]
list List tracked and untracked elements in the workspace [aliases: l]
send Send files to external storages [aliases: s, upload, push]
bring Bring files from external storages [aliases: b, download, pull]
remove Remove files from Xvc cache and storages [aliases: R]
untrack Untrack (delete) files from Xvc and storages [aliases: U]
share Share a file from (S3 compatible) storage for a limited time [aliases: S]
help Print this message or the help of the given subcommand(s)
-v, --verbose... Verbosity level. Use multiple times to increase command output detail
--quiet Suppress error messages
-C <WORKDIR> Set the working directory to run the command as if it's in that directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value
--no-system-config Ignore system config file
--no-user-config Ignore user config file
--no-project-config Ignore project config (.xvc/config)
--no-local-config Ignore local config (.xvc/config.local)
--no-env-config Ignore configuration options from the environment
-h, --help Print help
-V, --version Print version
: Track (add) files with Xvcrecheck
: Copy/link files in the cache to the workspace (checkout)carry-in
: Carry-in (commit) changed files to cachecopy
: Copy files to another location in the workspacemove
: Move files to another location in the workspacelist
: List tracked filessend
: Send (push- ) files to storage
: Bring (pull) files from storagehash
: Calculate hashes with supported algorithms similar to sha256sum, blake2sum, etc.remove
: Remove files from Xvc cache or storagesuntrack
: Untrack (delete) files from Xvc
xvc file track
xvc file track
is used to register any kind of file to Xvc for tracking versions.
$ xvc file track --help
Add file and directories to Xvc
Usage: xvc file track [OPTIONS] [TARGETS]...
Files/directories to track
--recheck-method <RECHECK_METHOD>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
Do not copy/link added files to the file cache
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
Include git tracked files as well. (Default: false)
Xvc doesn't track files that are already tracked by git by default. You can set files.track.include-git to true in the configuration file to change this behavior.
Add targets even if they are already tracked
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
File tracking works only in Xvc repositories.
$ git init
$ xvc init
Let's create a directory tree for these examples.
$ xvc-test-helper create-directory-tree --directories 4 --files 3 --seed 20231021
$ tree
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
├── dir-0002
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
├── dir-0003
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0004
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
5 directories, 12 files
By default, the command runs similar to git add
and git commit
You can track individual files.
$ xvc file track dir-0001/file-0001.bin
You can track directories with the same command.
$ xvc file track dir-0002/
You can specify more than one target in a single command.
$ xvc file track dir-0001/file-0002.bin dir-0001/file-0003.bin
Files tracked by Git
By default, Xvc doesn't track files tracked by Git. You need to specify
options to track files tracked by git.
Xvc detects files tracked by git with the output of git ls-files
. In default
configuration Git encodes UTF-8 file names in octal format. As Xvc uses
UTF-8 internally to keep track of paths, it cannot identify files are tracked
by Git if they have non-ASCII characters.
Please set
git config core.quotepath off
in your Xvc repository to let Git list files in UTF-8.
When you track a file, Xvc moves the file to the cache directory under .xvc/
and connects the workspace file with the cached file. This connection is
called rechecking and analogous to Git checkout. For example, the above
commands create a directory tree under .xvc
as follows:
$ tree .xvc/b3
├── 493
│ └── eeb
│ └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│ └── 0.bin
├── ab3
│ └── 619
│ └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│ └── 0.bin
└── e51
└── 7d6
└── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
└── 0.bin
10 directories, 3 files
There are different recheck (checkout) methods that Xvc connects the workspace file to the cache. The default method for this is copying the file to the workspace. This way a separate copy of the cache file is created in the workspace.
If you want to make this connection with symbolic links, you can specify it with --recheck-method
$ xvc file track --recheck-method symlink dir-0003/file-0001.bin
$ ls -l dir-0003/file-0001.bin
lrwxr-xr-x[..] dir-0003/file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
You can also use --hardlink
and --reflink
options. Please see xvc file recheck
reference for details.
$ xvc file track --recheck-method hardlink dir-0003/file-0002.bin
$ xvc file track --recheck-method reflink dir-0003/file-0003.bin
$ ls -l dir-0003/
total 16
l[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
-[..] file-0002.bin
-[..] file-0003.bin
Note that, unlike DVC that specifies checkout/recheck option repository wide, Xvc lets you specify per file. You can recheck files data files as symbolic links (which are non-writable) and save space and make model files as copies of the cached original and commit (carry-in) every time they change.
When you track a file in Xvc, it's automatically commit (carry-in) to the cache
directory. If you want to postpone this operation and don't need a cached copy
for a file, you can use --no-commit
option. You can later use xvc file
carry-in command to move these files to the repository
$ xvc file track --no-commit --recheck-method symlink dir-0004/
$ ls -l dir-0004/
total 24
-rw-r--r--[..] file-0001.bin
-rw-r--r--[..] file-0002.bin
-rw-r--r--[..] file-0003.bin
$ xvc file list dir-0004/
FS [..] ab361981 ab361981 dir-0004/file-0003.bin
FS [..] 493eeb65 493eeb65 dir-0004/file-0002.bin
FS [..] e517d6b9 e517d6b9 dir-0004/file-0001.bin
Total #: 3 Workspace Size: 6006 Cached Size: 6006
You can carry-in (commit) these files to the cache with xvc file carry-in
command. Note that, as the files are deduplicated, we need to use --force
carry-in command. This behavior may change in the future.
$ xvc file carry-in --force dir-0004/
$ ls -l dir-0004/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/493/eeb/6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/ab3/619/814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0/0.bin
Xvc deduplicates files in the cache. If you track a file that is already in the cache, it won't be moved to the cache again. It will be copied, linked from the same copy.
$ tree .xvc/b3
├── 493
│ └── eeb
│ └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│ └── 0.bin
├── ab3
│ └── 619
│ └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│ └── 0.bin
└── e51
└── 7d6
└── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
└── 0.bin
10 directories, 3 files
This command doesn't discriminate symbolic links or hardlinks. Links are followed and any broken links may cause errors.
Under the hood, Xvc tracks only the files, not directories. Directories are considered as path collections. It doesn't matter if you track a directory or files in it separately.
Technical Details
- Detecting changes in files and directories employ different kinds of associated digests. If a file has different metadata digest, its content digest is calculated. If file's content digest has changed, the file is considered changed. A directory that contains different set of files, or files with changed content is considered changed.
xvc file untrack
$ xvc file untrack --help
Untrack (delete) files from Xvc and storages
Usage: xvc file untrack [OPTIONS] [TARGETS]...
[TARGETS]... Files/directories to untrack
--restore-versions <RESTORE_VERSIONS>
Restore all versions to a directory before deleting the cache files
-h, --help
Print help
This command removes a file from Xvc tracking and optionally deletes it from the local filesystem, cache, and the storages.
It only works if the file is tracked by Xvc.
$ git init
$ xvc init
$ xvc file track 'd*.txt'
$ xvc file list
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
Without any options, it removes the file from Xvc tracking and the cache.
xvc file untrack
doesn't modify the .gitignore
files to remove the previously tracked files. You must do it manually if you want to track the file with Git.
$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ git status
On branch [..]
nothing to commit, working tree clean
If you have rechecked the file as symlink or reflink, it will be copied to the workspace.
$ xvc file track data.txt --as symlink
$ lsd -l
lrwxr-xr-x [..] data.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ lsd -l
.rw-rw-rw- [..] data.txt
If there are multiple versions of the file, it removes them all and restores the latest version.
If you want to restore all versions of the file, you can specify a directory to restore them.
$ xvc file track data.txt
$ perl -pi -e 's/a/e/g' data.txt
$ xvc file carry-in data.txt
$ xvc file untrack data.txt --restore-versions data-versions/
[COPY] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt -> [CWD]/data-versions/data-b3-660-2cf-f6a4.txt
[COPY] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data-versions/data-b3-c85-f3e-8108.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
$ lsd -l data-versions/
.r--r--r-- [..] data-b3-660-2cf-f6a4.txt
.r--r--r-- [..] data-b3-c85-f3e-8108.txt
If multiple paths are pointing to the same cache file (with deduplication), the cache file will not be
deleted. In this case, untrack
reports other paths pointing to the same cache file. You must untrack all of them to
delete the cache file.
$ xvc file track data.txt
$ xvc file copy data.txt data2.txt --as symlink
$ xvc file untrack data.txt
Not deleting b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt (for data.txt) because it's also used by data2.txt
$ tree .xvc/b3/
└── 660
└── 2cf
└── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
└── 0.txt
4 directories, 1 file
$ xvc file untrack data2.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
xvc file list
$ xvc file list --help
List tracked and untracked elements in the workspace
Usage: xvc file list [OPTIONS] [TARGETS]...
Files/directories to list.
If not supplied, lists all files under the current directory.
-f, --format <FORMAT>
A string for each row of the output table
The following are the keys for each row:
- {{acd8}}: actual content digest from the workspace file. First 8 digits.
- {{acd64}}: actual content digest. All 64 digits.
- {{aft}}: actual file type. Whether the entry is a file (F), directory (D),
symlink (S), hardlink (H) or reflink (R).
- {{asz}}: actual size. The size of the workspace file in bytes. It uses MB,
GB and TB to represent sizes larger than 1MB.
- {{ats}}: actual timestamp. The timestamp of the workspace file.
- {{name}}: The name of the file or directory.
- {{cst}}: cache status. One of "=", ">", "<", or "X" to show
whether the file timestamp is the same as the cached timestamp, newer,
older, and not tracked.
- {{rcd8}}: recorded content digest stored in the cache. First 8 digits.
- {{rcd64}}: recorded content digest stored in the cache. All 64 digits.
- {{rrm}}: recorded recheck method. Whether the entry is linked to the workspace
as a copy (C), symlink (S), hardlink (H) or reflink (R).
- {{rsz}}: recorded size. The size of the cached content in bytes. It uses
MB, GB and TB to represent sizes larged than 1MB.
- {{rts}}: recorded timestamp. The timestamp of the cached content.
The default format can be set with file.list.format in the config file.
TODO: Think how to add a completion to ListFormat
-s, --sort <SORT>
Sort criteria.
It can be one of none (default), name-asc, name-desc, size-asc, size-desc, ts-asc, ts-desc.
The default option can be set with file.list.sort in the config file.
Don't show total number and size of the listed files.
The default option can be set with file.list.no_summary in the config file.
-d, --show-directories
Don't hide directories
Directories are not listed by default. This flag lists them.
-D, --show-dot-files
Don't hide dot files
If not supplied, hides dot files like .gitignore and .xvcignore
List files tracked by Git.
By default, Xvc doesn't list files tracked by Git. Supply this option to list them.
-h, --help
Print help (see a summary with '-h')
For these examples, we'll create a directory tree with five directories, each having 5 files.
$ xvc-test-helper create-directory-tree --directories 5 --files 5 --seed 20230213
$ tree
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0002
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0003
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0004
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
└── dir-0005
├── file-0001.bin
├── file-0002.bin
├── file-0003.bin
├── file-0004.bin
└── file-0005.bin
[..] directories, 25 files
xvc file list
command works only in Xvc repositories. As we didn't initialize
a repository yet, it reports an error.
$ xvc file list
? 1
[ERROR] File Error: [E2004] Requires xvc repository.
Error: FileError { source: RequiresXvcRepository }
Let's initialize the repository.
$ git init
$ xvc init
Now it lists all files
$ xvc file list --sort name-asc
FX [..] 1953f05d dir-0001/file-0001.bin
FX [..] 7e807161 dir-0001/file-0002.bin
FX [..] d2432259 dir-0001/file-0003.bin
FX [..] 63535612 dir-0001/file-0004.bin
FX [..] 447933dc dir-0001/file-0005.bin
FX [..] 1953f05d dir-0002/file-0001.bin
FX [..] 7e807161 dir-0002/file-0002.bin
FX [..] d2432259 dir-0002/file-0003.bin
FX [..] 63535612 dir-0002/file-0004.bin
FX [..] 447933dc dir-0002/file-0005.bin
FX [..] 1953f05d dir-0003/file-0001.bin
FX [..] 7e807161 dir-0003/file-0002.bin
FX [..] d2432259 dir-0003/file-0003.bin
FX [..] 63535612 dir-0003/file-0004.bin
FX [..] 447933dc dir-0003/file-0005.bin
FX [..] 1953f05d dir-0004/file-0001.bin
FX [..] 7e807161 dir-0004/file-0002.bin
FX [..] d2432259 dir-0004/file-0003.bin
FX [..] 63535612 dir-0004/file-0004.bin
FX [..] 447933dc dir-0004/file-0005.bin
FX [..] 1953f05d dir-0005/file-0001.bin
FX [..] 7e807161 dir-0005/file-0002.bin
FX [..] d2432259 dir-0005/file-0003.bin
FX [..] 63535612 dir-0005/file-0004.bin
FX [..] 447933dc dir-0005/file-0005.bin
Total #: 25 Workspace Size: 50075 Cached Size: 0
Listing Directories
xvc file list
doesn't list directories by default. If you want to list them, you can use --show-directories
$ xvc file list --show-directories
FX 2005 [..] 447933dc dir-0005/file-0005.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
DX 224 [..] dir-0005
FX 2005 [..] 447933dc dir-0004/file-0005.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
DX 224 [..] dir-0004
FX 2005 [..] 447933dc dir-0003/file-0005.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
DX 224 [..] dir-0003
FX 2005 [..] 447933dc dir-0002/file-0005.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
DX 224 [..] dir-0002
FX 2005 [..] 447933dc dir-0001/file-0005.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
DX 224 [..] dir-0001
Total #: 30 Workspace Size: 51195 Cached Size: 0
Files tracked by Git
This command doesn't list Git-tracked files by default. If you want to list them, use --include-git-files
$ zsh -c 'echo "#!/bin/bash" >'
$ git add
$ git commit -m "Added a script"
[main [..]] Added a script
1 file changed, 1 insertion(+)
create mode 100644
$ xvc file list ''
Total #: 0 Workspace Size: 0 Cached Size: 0
$ xvc file list --include-git-files ''
FX 12 [..] 6ecb3ffc
Total #: 1 Workspace Size: 12 Cached Size: 0
Xvc detects files tracked by git with the output of git ls-files
. In default
configuration Git encodes UTF-8 file names in octal format. As Xvc uses
UTF-8 internally to keep track of paths, it cannot identify files are tracked
by Git if they have non-ASCII characters.
Please set
git config core.quotepath off
in your Xvc repository to let Git list files in UTF-8.
By default the command hides dotfiles too. If you also want to show them, you can use --show-dot-files
flag. If you want to show dotfiles also tracked by git, you may use --show-dot-files
and --include-git-files
$ xvc file list --sort name-asc --show-dot-files --include-git-files
FX 107 [..] ce9fcf30 .gitignore
FX 141 [..] 3054b812 .xvcignore
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2005 [..] 447933dc dir-0001/file-0005.bin
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2005 [..] 447933dc dir-0002/file-0005.bin
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2005 [..] 447933dc dir-0003/file-0005.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2005 [..] 447933dc dir-0004/file-0005.bin
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2005 [..] 447933dc dir-0005/file-0005.bin
FX 12 [..] 6ecb3ffc
Total #: 28 Workspace Size: 50335 Cached Size: 0
You can also hide the summary below the list to get only the list of files.
$ xvc file list --no-summary --sort name-asc
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2005 [..] 447933dc dir-0001/file-0005.bin
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2005 [..] 447933dc dir-0002/file-0005.bin
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2005 [..] 447933dc dir-0003/file-0005.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2005 [..] 447933dc dir-0004/file-0005.bin
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2005 [..] 447933dc dir-0005/file-0005.bin
Output Format
With the default output format, the first two letters show the path type and recheck method, respectively.
For example, if you track dir-0001
as copy
, the first letter is F
for the
files and D
for the directories. The second letter is C
for files, meaning
the file is a copy of the cached file, and it's X
for directories that means
they are not in the cache. Similar to Git, Xvc doesn't track only files and
directories are considered as collection of files.
$ xvc file track dir-0001/
$ xvc file list dir-0001/
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
If you add another set of files as hardlinks to the cached copies, it will
print the second letter as H
$ xvc file track dir-0002/ --recheck-method hardlink
$ xvc file list dir-0002
FH 2005 [..] 447933dc 447933dc dir-0002/file-0005.bin
FH 2004 [..] 63535612 63535612 dir-0002/file-0004.bin
FH 2003 [..] d2432259 d2432259 dir-0002/file-0003.bin
FH 2002 [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH 2001 [..] 1953f05d 1953f05d dir-0002/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
Note, as hardlinks are files with the same inode in the file system
with alternative paths, they are detected as F
Symbolic links are typically reported as SS
in the first letters.
It means they are symbolic links on the file system and their recheck method is also
symbolic links.
$ xvc file track dir-0003 --recheck-method symlink
$ xvc file list dir-0003
SS [..] 447933dc dir-0003/file-0005.bin
SS [..] 63535612 dir-0003/file-0004.bin
SS [..] d2432259 dir-0003/file-0003.bin
SS [..] 7e807161 dir-0003/file-0002.bin
SS [..] 1953f05d dir-0003/file-0001.bin
Total #: 5 Workspace Size: [..] Cached Size: 10015
Although not all filesystems support it, R
represents reflinks.
You may use globs to list files.
$ xvc file list 'dir-*/*-0001.bin'
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
SS [..] 1953f05d dir-0003/file-0001.bin
FH 2[..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC 2[..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: [..] Cached Size: 2001
Note that all these files are identical. They are cached once, and only one of them takes space in the cache.
You can also use multiple targets as globs.
$ xvc file list '*/*-0001.bin' '*/*-0002.bin'
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
SS [..] 7e807161 dir-0003/file-0002.bin
SS [..] 1953f05d dir-0003/file-0001.bin
FH [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH [..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 10 Workspace Size: [..] Cached Size: 4003
You may sort xvc file list
output by name, by modification time and by file
Use --sort
option to specify the sort criteria.
$ xvc file list --sort name-desc dir-0001/
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
$ xvc file list --sort name-asc dir-0001/
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
Column Format
You can specify the columns that the command prints.
For example, if you only want to see the file names, use {{name}}
as the
format string.
The following command sorts all files with their sizes in the workspace, and prints their size and name.
$ xvc file list --format '{{asz}} {{name}}' --sort size-desc dir-0001/
2005 dir-0001/file-0005.bin
2004 dir-0001/file-0004.bin
2003 dir-0001/file-0003.bin
2002 dir-0001/file-0002.bin
2001 dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: [..]
If you want to compare the recorded (cached) hashes and actual hashes in the workspace, you can use {{acd}} {{rcd}} {{name}}
format string.
$ xvc file list --format '{{acd8}} {{rcd8}} {{name}}' --sort ts-asc dir-0001
1953f05d 1953f05d dir-0001/file-0001.bin
7e807161 7e807161 dir-0001/file-0002.bin
d2432259 d2432259 dir-0001/file-0003.bin
63535612 63535612 dir-0001/file-0004.bin
447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
If {{acd8}}
or {{acd64}}
is not present in the format string, Xvc doesn't calculate these hashes. If you have large number of files where the default format (that includes actual content hashes) runs slowly, you can customize it to not to include these columns.
If you want to get a quick glimpse of what needs to carried in, or rechecked,
you can use cache status {{cst}}
$ xvc-test-helper generate-random-file --size 100 dir-0001/a-new-file.bin
$ xvc file list --format '{{cst}} {{name}}' dir-0001/
= dir-0001/file-0005.bin
= dir-0001/file-0004.bin
= dir-0001/file-0003.bin
= dir-0001/file-0002.bin
= dir-0001/file-0001.bin
X dir-0001/a-new-file.bin
Total #: 6 Workspace Size: 10115 Cached Size: 0
The cache status column shows =
for unchanged files in the cache, X
untracked files, >
for files that there is newer version in the cache, and <
for files that there is a newer version in the workspace. The comparison is done
between recorded timestamp and actual timestamp with an accuracy of 1 second.
Ignored Files
Ignored files and directories in .xvcignore
are not listed in the results.
$ zsh -c "echo 'dir-0005' > .xvcignore"
$ xvc file list --format='{{name}}' --no-summary
xvc file hash
$ xvc file hash --help
Get digest hash of files with the supported algorithms
Usage: xvc file hash [OPTIONS] [TARGETS]...
Files to process
NOTE: This uses the default completion as the command can work anywhere with any file
-a, --algorithm <ALGORITHM>
Algorithm to calculate the hash. One of blake3, blake2, sha2, sha3. All algorithm variants produce 32-bytes digest
--text-or-binary <TEXT_OR_BINARY>
For "text" remove line endings before calculating the digest. Keep line endings if "binary". "auto" (default) detects the type by checking 0s in the first 8Kbytes, similar to Git
[default: auto]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
xvc file recheck
$ xvc file recheck --help
Get files from cache by copy or *link
Usage: xvc file recheck [OPTIONS] [TARGETS]...
Files/directories to recheck
--recheck-method <RECHECK_METHOD>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink support requires "reflink" feature to be enabled and uses copy if the underlying file system doesn't support it.
Don't use parallelism
Force even if target exists
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command has an alias xvc file checkout
if you feel more at home with Git terminology.
Rechecking is analogous to git checkout. It copies or links a cached file to the workspace.
Let's create an example directory hierarchy as a showcase.
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 231123
$ tree
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Start by tracking files.
$ git init
$ xvc init
$ xvc file track dir-*
Once you added the file to the cache, you can delete the workspace copy.
$ rm dir-0001/file-0001.bin
$ lsd -l dir-0001/file-*
drwxr-xr-x [..] dir-0001
drwxr-xr-x [..] dir-0002
Then, recheck the file. By default, it makes a copy of the file.
$ xvc file recheck dir-0001/file-0001.bin
$ lsd -l
.rw-rw-rw- [..] data.txt
You can track and recheck complete directories
$ xvc file track dir-0002/
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ lsd -l dir-0002/
total 24
-rw-rw-rw-[..] file-0001.bin
-rw-rw-rw-[..] file-0002.bin
-rw-rw-rw-[..] file-0003.bin
You can use glob patterns to recheck files.
$ xvc file track 'dir-*'
You can update the recheck method of a file. Otherwise it will be kept as same before.
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/ --as symlink
$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin
Symlink and hardlinks are read-only. You can recheck as copy to update.
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
? 1
zsh:1: permission denied: dir-0002/file-0001.bin
$ xvc file recheck dir-0002/file-0001.bin --as copy
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.
$ xvc -vv file recheck data.txt --as hardlink
$ ls -l
drwxr-xr-x[..] dir-0001
drwxr-xr-x[..] dir-0002
Reflinks are supported by Xvc, but the underlying file system should also support it.
Otherwise it uses copy
$ rm -f data.txt
$ xvc file recheck data.txt --as reflink
The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.
xvc file carry-in
Copies the file changes to cache.
$ xvc file carry-in --help
Carry in changed files to cache
Usage: xvc file carry-in [OPTIONS] [TARGETS]...
Files/directories to carry in to the cache
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
Carry in targets even their content digests are not changed.
This removes the file in cache and re-adds it.
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Carry in command works with Xvc repositories.
$ git init
$ xvc init
We first track a file.
$ xvc file track data.txt
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
We update the file with a command.
$ perl -i -pe 's/a/ee/g' data.txt
$ cat data.txt
Oh, deetee, my, deetee
$ xvc file list data.txt
FC 23 [..] c85f3e81 e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 19
Note that the size of the file has increased, as we replace each a
with an ee
$ xvc file carry-in data.txt
$ xvc file list data.txt
FC 23 [..] e37c686a e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 23
xvc file send
$ xvc file send --help
Send files to external storages
Usage: xvc file send [OPTIONS] --storage <STORAGE> [TARGETS]...
[TARGETS]... Targets to send/push/upload to storage
-s, --storage <STORAGE> Storage name or guid to send the files
--force Force even if the files are already present in the storage
-h, --help Print help
xvc file bring
$ xvc file bring --help
Bring files from external storages
Usage: xvc file bring [OPTIONS] --storage <STORAGE> [TARGETS]...
Targets to bring from the storage
-s, --storage <STORAGE>
Storage name or guid to send the files
Force even if the files are already present in the workspace
Don't recheck (checkout) after bringing the file to cache.
This makes the command similar to `git fetch` in Git. It just updates the cache, and doesn't copy/link the file to workspace.
--recheck-as <RECHECK_AS>
Recheck (checkout) the file in one of the four alternative ways. (See `xvc file recheck`) and [RecheckMethod]
-h, --help
Print help (see a summary with '-h')
xvc file share
$ xvc file share --help
Share a file from (S3 compatible) storage for a limited time
Usage: xvc file share [OPTIONS] --storage <STORAGE> <TARGET>
<TARGET> File to send/push/upload to storage
-s, --storage <STORAGE> Storage name or guid to send the files
-d, --duration <DURATION> Period to send the files to. You can use s, m, h, d, w suffixes [default: 24h]
-h, --help Print help
This command requires an Xvc repository to share files from S3 and compatible storages.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20240228
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
You can share a file tracked by Xvc by first configuring an S3 compatible storage.
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new s3 --name backup --bucket-name xvc-test --region eu-central-1 --storage-prefix xvc-storage
You must first send files to the remote storage.
$ xvc file send --storage backup dir-0001/
Now you can share the files. It will create a URL for you to share that file. (Here we use cut to make the command repeatable)
$ zsh -cl 'xvc file share --storage backup dir-0001/file-0001.bin | cut -c -50'
Note that the default period is 24 hours. You can set another period with --duration
$ zsh -cl 'xvc file share --duration 1h --storage backup dir-0001/file-0002.bin | cut -c -50'
You can get another URL for a shared file with a different period.
$ zsh -cl 'xvc file share --duration 1m --storage backup dir-0001/file-0002.bin | cut -c -50'
See humantime duration parsing documentation for duration expressions.
xvc file move
$ xvc file move --help
Move files to another location in the workspace
Usage: xvc file move [OPTIONS] <SOURCE> <DESTINATION>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If there are multiple source files, the destination must be a directory.
Location we move file(s) to within the workspace.
If this ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory.
--recheck-method <RECHECK_METHOD>
How the destination should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command is used to move a set of files to another location in the workspace.
By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.
xvc file move
works only with the tracked files.
$ git init
$ xvc init
$ xvc file track data.txt
$ lsd -l
.rw-rw-rw- [..] data.txt
Once you add the file to the cache, you can move the file to another location.
$ xvc file move data.txt data2.txt
$ ls
Xvc can change the destination file's recheck method.
$ xvc file move data2.txt data3.txt --as symlink
$ ls -l
lrwxr-xr-x[..] data3.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can move files without them being in the workspace if they are in the cache.
$ rm -f data3.txt
$ xvc file move data3.txt data4.txt
$ ls -l
total 0
lrwxr-xr-x[..] data4.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can use glob patterns to move multiple files. In this case, the destination must be a directory.
$ xvc file copy data4.txt data5.txt
$ xvc file move d*.txt another-set/ --as hardlink
$ xvc file list another-set/
FH [..] c85f3e81 c85f3e81 another-set/data5.txt
FH [..] c85f3e81 c85f3e81 another-set/data4.txt
Total #: 2 Workspace Size: 38 Cached Size: 19
You can also skip rechecking.
In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache.
They will be listed with xvc file list
$ xvc file move another-set/data5.txt data6.txt --no-recheck
$ xvc file list
XH c85f3e81 data6.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data4.txt
Total #: 2 Workspace Size: 19 Cached Size: 19
Later, you can recheck them in the workspace.
$ xvc file recheck data6.txt
$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt
xvc file copy
$ xvc file copy --help
Copy from source to another location in the workspace
Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If the number of source files is more than one, the destination must be a directory.
Location we copy file(s) to within the workspace.
If the target ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory. TODO: Add a tracked directory completer we can have a file or a directory that we track and not available or we don't track and available. It's similar situation to xvc_path_completer but we also need to check the local paths.
--recheck-method <RECHECK_METHOD>
How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
Force even if target exists
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace
When copying multiple files, by default whole path is copied to the destination. This option sets the destination to be created with the file name only
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command is used to copy a set of files to another location in the workspace.
By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.
xvc file copy
works only with the tracked files.
$ git init
$ xvc init
$ xvc file track data.txt
$ lsd -l
.rw-rw-rw- [..] data.txt
Once you add the file to the cache, you can copy the file to another location.
$ xvc file copy data.txt data2.txt
$ ls
Note that, multiple copies of the same content don't add up to the cache size.
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ xvc file list 'data*'
FC 19 [..] c85f3e81 c85f3e81 data2.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 2 Workspace Size: 38 Cached Size: 19
Xvc can change the destination file's recheck method.
$ xvc file copy data.txt data3.txt --as symlink
$ lsd -l
.rw-rw-rw- [..] data.txt
.rw-rw-rw- [..] data2.txt
lrwxr-xr-x [..] data3.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can create views of your data by copying it to another location.
$ xvc file copy 'd*' another-set/ --as hardlink
$ xvc file list another-set/
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 3 Workspace Size: 57 Cached Size: 19
If the source files you specify are changed, Xvc cancels the copy operation. Please either recheck old versions or carry in new versions.
$ perl -i -pe 's/a/ee/g' data.txt
$ xvc file copy data.txt data5.txt
? 1
[ERROR] File Error: Sources have changed, please carry-in or recheck following files before copying: data.txt
Error: FileError { source: SourcesHaveChanged { message: "Sources have changed, please carry-in or recheck following files before copying", files: "data.txt" } }
You can copy files without them being in the workspace if they are in the cache.
$ rm -f data.txt
$ xvc file copy data.txt data6.txt
$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt
You can also skip rechecking.
In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache.
They will be listed with xvc file list
$ xvc file copy data.txt data7.txt --no-recheck
$ ls
$ xvc file list
XC [..] c85f3e81 data7.txt
FC 19 [..] c85f3e81 c85f3e81 data6.txt
SS [..] [..] c85f3e81 data3.txt
FC 19 [..] c85f3e81 c85f3e81 data2.txt
XC [..] c85f3e81 data.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 8 Workspace Size: [..] Cached Size: 19
Later, you can recheck them to work in the workspace.
$ xvc file recheck data7.txt
$ lsd -l data7.txt
.rw-rw-rw- [..] data7.txt
xvc file remove
$ xvc file remove --help
Remove files from Xvc cache and storages
Usage: xvc file remove [OPTIONS] [TARGETS]...
Files/directories to remove
Remove files from cache
--from-storage <FROM_STORAGE>
Remove files from storage
Remove all versions of the file
--only-version <ONLY_VERSION>
Remove only the specified version of the file
Versions are specified with the content hash 123-456-789abcd. Dashes are optional. Prefix must be unique. If the prefix is not unique, the command will fail.
Remove file versions even if they are also pointed by other targets (via deduplication)
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command deletes files from the Xvc cache or storage. It doesn't remove the file from Xvc tracking.
If you want to remove a workspace file or link, you can use usual rm
command. If the file is tracked and carried in to the cache, you can always recheck it.
This command only works if the file is tracked by Xvc.
$ git init
$ xvc init
$ xvc file track 'd*.txt'
$ xvc file list
FC [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ tree .xvc/b3/
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
4 directories, 1 file
If you don't specify either --from-cache
or --from-storage
, this command does nothing.
$ xvc file remove data.txt
? failed
error: the following required arguments were not provided:
--from-storage <FROM_STORAGE>
Usage: xvc file remove --from-cache --from-storage <FROM_STORAGE> <TARGETS>...
For more information, try '--help'.
You can remove the file from the cache. The file is still tracked by Xvc and available in the workspace.
$ xvc file remove --from-cache data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ ls
$ ls .xvc/
You can carry the missing file from the workspace to the cache. Use --force
to overwrite the cache as carry-in
doesn't overwrite the cache by default.
$ xvc file carry-in --force data.txt
$ xvc file list
FC [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ tree .xvc/b3/
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
4 directories, 1 file
You can specify a version of a file to delete from the cache. The versions can
be specified like 123-456-789abcd
. Dashes are optional. The prefix must be unique.
$ perl -pi -e 's/a/e/g' data.txt
$ xvc file carry-in data.txt
$ tree .xvc/b3/
├── 660
│ └── 2cf
│ └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│ └── 0.txt
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
7 directories, 2 files
$ xvc file list
FC [..] 6602cff6 6602cff6 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ xvc file remove --from-cache --only-version c85-f3e data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ tree .xvc/b3/
└── 660
└── 2cf
└── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
└── 0.txt
4 directories, 1 file
You can also remove all versions of a file from the cache.
$ xvc-test-helper generate-random-file --seed 0 data.txt
$ xvc file carry-in data.txt
$ rm data.txt
$ xvc-test-helper generate-random-file --seed 1 data.txt
$ xvc file carry-in data.txt
$ tree .xvc/b3/
├── 017
│ └── ad8
│ └── 6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0
│ └── 0.txt
├── 660
│ └── 2cf
│ └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│ └── 0.txt
└── fef
└── e16
└── d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152
└── 0.txt
10 directories, 3 files
$ xvc file remove --from-cache --all-versions data.txt
[DELETE] [CWD]/.xvc/b3/017/ad8/6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/fef/e16/d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152/0.txt
$ ls .xvc/
You can use this command to remove cached files from (remote) storages as well.
$ xvc-test-helper generate-random-file --seed 2 data.txt
$ xvc file carry-in data.txt
$ xvc storage new local --name local-storage --path '../local-storage'
$ xvc file send data.txt --to local-storage
$ tree ../local-storage/
└── [..]
└── b3
└── 218
└── 2b7
└── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
└── 0.txt
6 directories, 1 file
$ xvc file remove data.txt --from-storage local-storage
$ tree ../local-storage/
└── [..]
└── b3
└── 218
└── 2b7
└── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
6 directories, 0 files
Note that, storage delete implementations differ slightly not to remove the directories. This is to avoid unnecessary round trip existence checks.
If multiple paths are pointing to the same cache file (deduplication), the cache file will not be deleted.
In this case, remove
reports other paths pointing to the same cache file. You must --force
delete the cache file.
$ xvc-test-helper generate-random-file --seed 3 data.txt
$ xvc file carry-in data.txt
$ xvc file copy data.txt data2.txt --as symlink
$ xvc file list
SS [..] [..] 4a2e9d7c data2.txt
FC 1024 [..] 4a2e9d7c 4a2e9d7c data.txt
Total #: 2 Workspace Size: [..] Cached Size: 1024
$ xvc file remove --from-cache data.txt
Not deleting b3/4a2/e9d/7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee/0.txt (for data.txt) because it's also used by data2.txt
$ tree .xvc/b3/
├── 218
│ └── 2b7
│ └── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
│ └── 0.txt
└── 4a2
└── e9d
└── 7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee
└── 0.txt
7 directories, 2 files
Data-Model Pipelines
$ xvc pipeline --help
Pipeline management commands
Usage: xvc pipeline [OPTIONS] <COMMAND>
new Create a new pipeline [aliases: n]
update Update the name and other attributes of a pipeline [aliases: u]
delete Delete a pipeline [aliases: D]
run Run a pipeline [aliases: r]
list List all pipelines [aliases: l]
dag Generate a Graphviz or mermaid diagram of the pipeline [aliases: d]
export Export the pipeline to a YAML or JSON file to edit [aliases: e]
import Import the pipeline from a file [aliases: i]
step Step creation, dependency, output commands [aliases: s]
help Print this message or the help of the given subcommand(s)
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
xvc pipeline new
$ xvc pipeline new --help
Create a new pipeline
Usage: xvc pipeline new [OPTIONS]
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-w, --workdir <WORKDIR> Default working directory
-h, --help Print help
This command works only in Xvc repositories.
$ git init
$ xvc init
You can create a new pipeline with a name.
$ xvc pipeline new --pipeline-name my-pipeline
By default it will run the commands in the repository root.
$ xvc pipeline list
| Name | Run Dir |
| default | |
| my-pipeline | |
If you want to define a pipeline specific to a directory, you can set the working directory.
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230215
$ xvc pipeline new --pipeline-name another-pipeline --workdir dir-0001
The pipeline will run the commands in the specified directory.
$ xvc pipeline list
| Name | Run Dir |
| default | |
| my-pipeline | |
| another-pipeline | dir-0001 |
xvc pipeline list
$ xvc pipeline list --help
List all pipelines
Usage: xvc pipeline list [OPTIONS]
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
Please see xvc pipeline new
for examples.
xvc pipeline step
$ xvc pipeline step --help
Step creation, dependency, output commands
Usage: xvc pipeline step [OPTIONS] <COMMAND>
list List steps in a pipeline [aliases: l]
new Add a new step [aliases: n]
remove Remove a step from a pipeline [aliases: R]
update Update a step's command or when options [aliases: U]
dependency Add a dependency to a step [aliases: d]
output Add an output to a step [aliases: o]
show Print step configuration [aliases: s]
help Print this message or the help of the given subcommand(s)
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
xvc pipeline step new
Create a new step in the pipeline.
$ xvc pipeline step new --help
Add a new step
Usage: xvc pipeline step new [OPTIONS] --step-name <STEP_NAME> --command <COMMAND>
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-s, --step-name <STEP_NAME> Name of the new step
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
This command works only in Xvc repositories.
$ git init
$ xvc init
You can create a new step with a name and a command.
$ xvc pipeline step new --step-name hello --command "echo hello"
By default a step will run only if its dependencies have changed. (--when by_dependencies
If you want to run the command always, regardless of the changes in dependencies, you can set --when
to always
$ xvc pipeline step new --step-name world --command "echo world" --when always
If you want a step to never run, you can set --when
to never
$ xvc pipeline step new --step-name never --command "echo never" --when never
You can update when the step will run with xvc pipeline step update
You can get the list of steps in the pipeline with export
or dag
$ xvc pipeline export
"name": "default",
"steps": [
"command": "echo hello",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "hello",
"outputs": []
"command": "echo world",
"dependencies": [],
"invalidate": "Always",
"name": "world",
"outputs": []
"command": "echo never",
"dependencies": [],
"invalidate": "Never",
"name": "never",
"outputs": []
"version": 1,
"workdir": ""
xvc pipeline step list
List the steps and their commands in a pipeline
$ xvc pipeline step list --help
List steps in a pipeline
Usage: xvc pipeline step list [OPTIONS]
--names-only Show only the names, otherwise print commands as well
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
This command works only in Xvc repositories.
$ git init
$ xvc init
You may want to list the steps of a pipeline and their commands.
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step new --step-name world --command "echo world" --when always
$ xvc pipeline step list
hello: echo hello (by_dependencies)
world: echo world (always)
It will list the commands and when they will run (always, never, by_dependencies) by default. If you only need the names of steps, you can use --names-only
$ xvc pipeline step list --names-only
xvc pipeline step dependency
Define a dependency to an existing step in the pipeline.
$ xvc pipeline step dependency --help
Add a dependency to a step
Usage: xvc pipeline step dependency [OPTIONS] --step-name <STEP_NAME>
-p, --pipeline-name <PIPELINE_NAME>
Name of the pipeline this command applies to
-s, --step-name <STEP_NAME>
Name of the step to add the dependency to
[aliases: for, to]
-G, --generic <GENERICS>
Add a generic command output as a dependency. Can be used multiple times. Please delimit the command with ' ' to avoid shell expansion
-u, --url <URLS>
Add a URL dependency to the step. Can be used multiple times
-f, --file <FILES>
Add a file dependency to the step. Can be used multiple times
-S, --step <STEPS>
Add a step dependency to a step. Can be used multiple times. Steps are referred with their names
--glob_items <GLOB_ITEMS>
Add a glob items dependency to the step.
You can depend on multiple files and directories with this dependency.
The difference between this and the glob option is that this option keeps track of all matching files, but glob only keeps track of the matched files' digest. When you want to use ${XVC_GLOB_ITEMS}, ${XVC_ADDED_GLOB_ITEMS}, or ${XVC_REMOVED_GLOB_ITEMS} environment variables in the step command, use the glob-items dependency. Otherwise, you can use the glob option to save disk space.
[aliases: glob-items, glob-i]
--glob <GLOBS>
Add a glob dependency to the step. Can be used multiple times.
You can depend on multiple files and directories with this dependency.
The difference between this and the glob-items option is that the glob-items option keeps track of all matching files individually, but this option only keeps track of the matched files' digest. This dependency uses considerably less disk space.
--param <PARAMS>
Add a parameter dependency to the step in the form filename.yaml::model.units
The file can be a JSON, TOML, or YAML file. You can specify hierarchical keys like my.dict.key
TODO: Add a pipeline_step_params completer
--regex_items <REGEX_ITEMS>
Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
The difference between this and the regex option is that the regex-items option keeps track of all matching lines, but regex only keeps track of the matched lines' digest. When you want to use ${XVC_REGEX_ITEMS}, ${XVC_ADDED_REGEX_ITEMS}, ${XVC_REMOVED_REGEX_ITEMS} environment variables in the step command, use the regex option. Otherwise, you can use the regex-digest option to save disk space.
--regex <REGEXES>
Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
The difference between this and the regex option is that the regex option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest.
--line_items <LINE_ITEMS>
Add a line dependency in the form filename.txt::123-234
The difference between this and the lines option is that the line-items option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. When you want to use ${XVC_ALL_LINE_ITEMS}, ${XVC_ADDED_LINE_ITEMS}, ${XVC_CHANGED_LINE_ITEMS} options in the step command, use the line option. Otherwise, you can use the lines option to save disk space.
--lines <LINES>
Add a line digest dependency in the form filename.txt::123-234
The difference between this and the line-items dependency is that the line option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. If you don't need individual lines to be kept, use this option to save space.
--sqlite-query <SQLITE_FILE> <SQLITE_QUERY>
Add a sqlite query dependency to the step with the file and the query. Can be used once.
The step is invalidated when the query run and the result is different from previous runs, e.g. when an aggregate changed or a new row added to a table.
-h, --help
Print help (see a summary with '-h')
This command works only in Xvc repositories.
$ git init
$ xvc init
Begin by adding a new step.
$ xvc pipeline step new --step-name file-dependency --command "echo data.txt has changed"
Add a file dependency to the step.
$ xvc pipeline step dependency --step-name file-dependency --file data.txt
When you run the command, it will print data.txt has changed
if the file data.txt
has changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
You can add multiple dependencies to a step with multiple invocations.
$ xvc pipeline step dependency --step-name file-dependency --file data2.txt
A step will run if any of its dependencies have changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
By default, they are not run if none of the dependencies have changed.
$ xvc pipeline run
However, if you want to run the step even if none of the dependencies have changed, you can set the --when
option to always
$ xvc pipeline step update --step-name file-dependency --when always
Now the step will run even if none of the dependencies have changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
A step can depend on multiple files specified with globs. The difference with this and glob-items dependency is that this one doesn't track the files, and doesn't pass the list of files in environment variables to the command.
This command works only in Xvc repositories.
$ git init
$ xvc init
Let's create a set of files:
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023
$ tree
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Add a step to say files has changed when the files have changed.
$ xvc pipeline step new --step-name files-changed --command "echo 'Files have changed.'"
$ xvc pipeline step dependency --step-name files-changed --glob 'dir-*/*'
The step is invalidated when a file described by the glob is added, removed or changed.
$ xvc pipeline run
[OUT] [files-changed] Files have changed.
[DONE] [files-changed] (echo 'Files have changed.')
$ xvc pipeline run
When a file is removed from the files described by the glob, the step is invalidated.
$ rm dir-0001/file-0001.bin
$ xvc pipeline run
[OUT] [files-changed] Files have changed.
[DONE] [files-changed] (echo 'Files have changed.')
You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.
This command works only in Xvc repositories.
$ git init
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Now, let's add a step to the pipeline to count females in the file:
$ xvc pipeline step new --step-name count-females --command "grep -c '\"F\",' people.csv"
These commands are run when the regex dependencies change.
$ xvc pipeline step dependency --step-name count-females --regex 'people.csv:/^.*"F",.*$'
When you run the pipeline initially, the steps are run.
$ xvc pipeline run
[OUT] [count-females] 7
[DONE] [count-females] (grep -c '"F",' people.csv)
When you run the pipeline again, the step is not run because the regex result didn't change.
$ xvc pipeline run
When you add a new female record to the file, the step is run and the command prints the new count.
$ zsh -c "echo '\"Asude\", \"F\", 12, 55, 110' >> people.csv"
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
"Asude", "F", 12, 55, 110
$ xvc pipeline run
[OUT] [count-females] 8
[DONE] [count-females] (grep -c '"F",' people.csv)
You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.
When the text in those lines change, the step is invalidated.
This command works only in Xvc repositories.
$ git init
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Let's a step to show the first 10 lines of the file:
$ xvc pipeline step new --step-name print-top-10 --command "head people.csv"
The command is run only when those lines change.
$ xvc pipeline step dependency --step-name print-top-10 --lines 'people.csv::1-10'
When you run the pipeline initially, the step is run.
$ xvc pipeline run
[OUT] [print-top-10] "Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
[DONE] [print-top-10] (head people.csv)
When you run the pipeline again, the step is not run because the specified lines didn't change.
$ xvc pipeline run
When you change a line from the file, the step is invalidated.
$ perl -i -pe 's/Hank/Ferzan/g' people.csv
Now, when you run the pipeline, it will print the first 10 lines again.
$ xvc pipeline run
[OUT] [print-top-10] "Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Ferzan", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
[DONE] [print-top-10] (head people.csv)
Glob Items
A step can depend on multiple files specified with globs. When any of the files change, or a new file is added or removed from the files specified by glob, the step is invalidated.
Unline glob dependency, glob items dependency keeps track of the individual files that belong to a glob. If your command run with the list of files from a glob and you want to track added and removed files, use this. Otherwise if your command for all the files in a glob and don't need to track which files have changed, use the glob dependency.
This one injects ${XVC_ADDED_GLOB_ITEMS}
to the command
This command works only in Xvc repositories.
$ git init
$ xvc init
Let's create a set of files:
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023
$ tree
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Add a step to list the added files.
$ xvc pipeline step new --step-name files-changed --command 'echo "### Added Files:\n${XVC_ADDED_GLOB_ITEMS}\n### Removed Files:\n${XVC_REMOVED_GLOB_ITEMS}\n### Changed Files:\n${XVC_CHANGED_GLOB_ITEMS}"'
$ xvc pipeline step dependency --step-name files-changed --glob-items 'dir-*/*'
The step is invalidated when a file described by the glob is added, removed or changed.
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
### Removed Files:
### Changed Files:
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
$ xvc pipeline run
If you add or remove a file from the files specified by the glob, they are printed.
$ rm dir-0001/file-0001.bin
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
### Removed Files:
### Changed Files:
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
When you change a file, it's printed in both added and removed files:
$ xvc-test-helper generate-filled-file dir-0001/file-0002.bin
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
### Removed Files:
### Changed Files:
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
Regex Items
You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.
Unlike regex dependencies, regex item dependencies keep track of the matched items. You can access them with
environment variables.
This command works only in Xvc repositories.
$ git init
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Now, let's add steps to the pipeline to count males and females in the file:
$ xvc pipeline step new --step-name new-males --command 'echo "New Males:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step new --step-name new-females --command 'echo "New Females:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step dependency --step-name new-females --step new-males
We also added a step dependency to let the steps run always in the same order.
These commands are run when the following regexes change.
$ xvc pipeline step dependency --step-name new-males --regex-items 'people.csv:/^.*"M",.*$'
$ xvc pipeline step dependency --step-name new-females --regex-items 'people.csv:/^.*"F",.*$'
When you run the pipeline initially, the steps are run.
$ xvc pipeline run
[OUT] [new-males] New Males:
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Luke", "M", 34, 72, 163
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Quin", "M", 29, 71, 176
[DONE] [new-males] (echo "New Males:/n ${XVC_ADDED_REGEX_ITEMS}")
[OUT] [new-females] New Females:
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Kate", "F", 47, 69, 139
"Myra", "F", 23, 62, 98
"Page", "F", 31, 67, 135
"Ruth", "F", 28, 65, 131
[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")
When you run the pipeline again, the steps are not run because the regexes didn't change.
$ xvc pipeline run
When you add a new female record to the file, only the female count step is run.
$ zsh -c "echo '\"Asude\", \"F\", 12, 55, 110' >> people.csv"
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
"Asude", "F", 12, 55, 110
$ xvc pipeline run
[OUT] [new-females] New Females:
"Asude", "F", 12, 55, 110
[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")
Line Items
You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.
When the text in those lines change, the step is invalidated.
Unlike line dependencies, this dependency type keeps track of the lines in the
file. You can use ${XVC_ALL_LINE_ITEMS}
, and
environment variables in the command. Please be
aware that for large set of lines, this dependency can take up considerable
space to keep track of all lines and if you don't need to keep track of changed
lines, you can use --lines
This command works only in Xvc repositories.
$ git init
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Let's a step to show the first 10 lines of the file:
$ xvc pipeline step new --step-name print-top-10 --command 'echo "Added Lines:\n ${XVC_ADDED_LINE_ITEMS}\nRemoved Lines:\n${XVC_REMOVED_LINE_ITEMS}"'
The command is run only when those lines change.
$ xvc pipeline step dependency --step-name print-top-10 --line-items 'people.csv::1-10'
When you run the pipeline initially, the step is run.
$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
Removed Lines:
[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")
When you run the pipeline again, the step is not run because the specified lines didn't change.
$ xvc pipeline run
When you change a line from the file, the step is invalidated.
$ perl -i -pe 's/Hank/Ferzan/g' people.csv
Now, when you run the pipeline, it will print the changed line, with its new and old versions.
$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
"Ferzan", "M", 30, 71, 158
Removed Lines:
"Hank", "M", 30, 71, 158
[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")
SQLite Query
You can create a step dependency with an SQLite query. When the query results change, the step is invalidated.
SQLite dependencies doesn't track the results of the query. It just checks whether the query results has changed.
This command works only in Xvc repositories.
$ git init
$ xvc init
Suppose we have an SQLite database people.db
with the following schema and data:
Name TEXT,
Height_in INTEGER,
Weight_lbs INTEGER
INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES
('Alex', 'M', 41, 74, 170),
('Bert', 'M', 42, 68, 166),
('Carl', 'M', 32, 70, 155),
('Dave', 'M', 39, 72, 167),
('Elly', 'F', 30, 66, 124),
('Fran', 'F', 33, 66, 115),
('Gwen', 'F', 26, 64, 121),
('Hank', 'M', 30, 71, 158),
('Ivan', 'M', 53, 72, 175),
('Jake', 'M', 32, 69, 143),
('Kate', 'F', 47, 69, 139),
('Luke', 'M', 34, 72, 163),
('Myra', 'F', 23, 62, 98),
('Neil', 'M', 36, 75, 160),
('Omar', 'M', 38, 70, 145),
('Page', 'F', 31, 67, 135),
('Quin', 'M', 29, 71, 176),
('Ruth', 'F', 28, 65, 131);
Now, we'll add a step to the pipeline to calculate the average age of these people.
$ xvc pipeline step new --step-name average-age --command "sqlite3 people.db 'SELECT AVG(Age) FROM People;'"
Let's run the step without a dependency first.
$ xvc pipeline run
[OUT] [average-age] 34.6666666666667
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
Now, we'll add a dependency to this step and it will only run the step when the results of that query changes.
$ xvc pipeline step dependency --step-name average-age --sqlite-query people.db 'SELECT count(*) FROM People;'
The dependency query is run everytime the pipeline runs. It's expected to be lightweight to avoid performance issues.
So, when the number of people in the table changes, the step will run. Initially it doesn't keep track of the query results, so it will run again.
$ xvc pipeline run
[OUT] [average-age] 34.6666666666667
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
But it won't run the step a second time, as the table didn't change.
$ xvc pipeline run
Let's add another row to the table:
$ sqlite3 people.db "INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES ('Asude', 'F', 10, 74, 170);"
This time, the step will run again as the result from dependency query (SELECT count(*) FROM People
) changed.
$ xvc pipeline run
[OUT] [average-age] 33.3684210526316
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
Xvc opens the database in read-only mode to avoid locking.
You may be keeping pipeline-wide parameters in structured text files. You can specify such parameters found in JSON, TOML and YAML files as dependencies.
This command works only in Xvc repositories.
$ git init
$ xvc init
Suppose we have a YAML file that we specify various parameters for the whole connection.
param: value
port: 5432
timeout: 5000
numeric_param: 13
Now, we create two steps to read different variables from the file and a dependency between them to force them to run in the same order always.
$ xvc pipeline step new --step-name read-database-config --command 'echo "Updated Database Configuration"'
$ xvc pipeline step new --step-name read-hyperparams --command 'echo "Update Hyperparameters"'
$ xvc pipeline step dependency --step-name read-database-config --step read-hyperparams
Let's create different steps for various pieces of this parameters file:
$ xvc pipeline step dependency --step-name read-database-config --param 'myparams.yaml::database.port' --param 'myparams.yaml::database.server' --param 'myparams.yaml::database.connection'
$ xvc pipeline step dependency --step-name read-hyperparams --param 'myparams.yaml::param' --param 'myparams.yaml::numeric_param'
Run for the first time, as initially all dependencies are invalid:
$ xvc pipeline run
[OUT] [read-hyperparams] Update Hyperparameters
[DONE] [read-hyperparams] (echo "Update Hyperparameters")
[OUT] [read-database-config] Updated Database Configuration
[DONE] [read-database-config] (echo "Updated Database Configuration")
For the second time, it won't read the configuration as nothing is changed:
$ xvc pipeline run
When you update a value in this file, it will only invalidate the steps that depend on the value, not other dependencies that rely on the same file.
Let's update the database port:
$ perl -pi -e 's/5432/9876/g' myparams.yaml
$ xvc pipeline run
[OUT] [read-database-config] Updated Database Configuration
[DONE] [read-database-config] (echo "Updated Database Configuration")
Note that, read-hyperparams
is not invalidated, though the values are in the same file.
This command works only in Xvc repositories.
$ git init
$ xvc init
You can add a step dependency to a step. These steps specify dependency relationships explicitly, without relying on changed files or directories.
$ xvc pipeline step new --step-name world --command "echo world"
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step dependency --step-name world --step hello
When run, the dependency will be run first and the step will be run after.
$ xvc pipeline run
[OUT] [hello] hello
[DONE] [hello] (echo hello)
[OUT] [world] world
[DONE] [world] (echo world)
If the dependency is not run, the dependent step won't run either.
$ xvc pipeline step update --step-name hello --when never
$ xvc pipeline run
If you want to run the dependent always, you can set it to run always explicitly.
$ xvc pipeline step update --step-name world --when always
$ xvc pipeline run
[OUT] [world] world
[DONE] [world] (echo world)
This command works only in Xvc repositories.
$ git init
$ xvc init
You can use a web URL as a dependency to a step. When the URL is fetched, the output hash is saved to compare and the step is invalidated when the output of the URL is changed.
You can use this with any URL.
$ xvc pipeline step new --step-name xvc-docs-update --command "echo 'Xvc docs updated!'"
$ xvc pipeline step dependency --step-name xvc-docs-update --url
The step is invalidated when the page is updated.
$ xvc pipeline run
[OUT] [xvc-docs-update] Xvc docs updated!
[DONE] [xvc-docs-update] (echo 'Xvc docs updated!')
The step won't run again until a new version of the page is published.
$ xvc pipeline run
Note that, Xvc doesn't download the page every time. It checks the Last-Modified
and Etag
headers and only downloads the page if it has changed.
If there are more complex requirements than just the URL changing, you can use a generic dependency to get the output of a command and use that as a dependency.
Generic Command
This command works only in Xvc repositories.
$ git init
$ xvc init
You can use the output of a shell command as a dependency to a step. When the command is run, the output hash is saved to compare and the step is invalidated when the output of the command changed.
You can use this for any command that outputs a string.
$ xvc pipeline step new --step-name morning-message --command "echo 'Good Morning!'"
$ xvc pipeline step dependency --step-name morning-message --generic 'date +%F'
The step is invalidated when the date changes and the step is run again.
$ xvc pipeline run
[OUT] [morning-message] Good Morning!
[DONE] morning-message (echo 'Good Morning!')
The step won't run until tomorrow, when date +%F
$ xvc pipeline run
[OUT] [morning-message] Good Morning!
[DONE] [morning-message] (echo 'Good Morning!')
You can mimic all kinds of pipeline behavior with this generic dependency.
For example, if you want to run a command when directory contents change, you can depend on the output of ls -lR
$ xvc pipeline step new --step-name directory-contents --command "echo 'Files changed'"
$ xvc pipeline step dependency --step-name directory-contents --generic 'ls'
$ xvc pipeline run
[OUT] [directory-contents] Files changed
[DONE] [directory-contents] (echo 'Files changed')
When you add a file to the directory, the step is invalidated and run again:
$ xvc pipeline run
$ xvc-test-helper generate-random-file new-file.txt
$ xvc pipeline run
[OUT] [directory-contents] Files changed
[DONE] [directory-contents] (echo 'Files changed')
Most shells support editing longer commands with an editor. For bash, you can use Ctrl+X Ctrl+E
Pipeline commands can get longer quickly. You can use xvc aliases for shorter
versions. Type source $(xvc aliases)
to load the aliases into your shell.
xvc pipeline step output
Define an output (file, metrics or plots) to an already existing step in the pipeline.
$ xvc pipeline step output --help
Add an output to a step
Usage: xvc pipeline step output [OPTIONS] --step-name <STEP_NAME>
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-s, --step-name <STEP_NAME> Name of the step to add the output to
--output-file <FILES> Add a file output to the step. Can be used multiple times
--output-metric <METRICS> Add a metric output to the step. Can be used multiple times
--output-image <IMAGES> Add an image output to the step. Can be used multiple times
-h, --help Print help
xvc pipeline step show
Print the steps of a pipeline.
$ xvc pipeline step show --help
Print step configuration
Usage: xvc pipeline step show [OPTIONS] --step-name <STEP_NAME>
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-s, --step-name <STEP_NAME> Name of the step to show
-h, --help Print help
xvc pipeline step update
Update the name, running condition, or command of a step.
$ xvc pipeline step update --help
Update a step's command or when options
Usage: xvc pipeline step update [OPTIONS] --step-name <STEP_NAME>
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-s, --step-name <STEP_NAME> Name of the step to update. The step should already be defined
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
xvc pipeline step remove
Remove a step and all its dependencies and outputs from the pipeline.
$ xvc pipeline step remove --help
Remove a step from a pipeline
Usage: xvc pipeline step remove [OPTIONS] --step-name <STEP_NAME>
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-s, --step-name <STEP_NAME> Name of the step to remove
-h, --help Print help
This command works only in Xvc repositories.
$ git init
$ xvc init
Let's create a few steps and make them depend on each other.
$ xvc pipeline step new --step-name hello --command 'echo hello >> hello.txt'
$ xvc pipeline step new --step-name world --command 'echo world >> world.txt'
$ xvc pipeline step new --step-name from --command 'echo from >> from.txt'
$ xvc pipeline step new --step-name xvc --command 'echo xvc >> xvc.txt'
Let's specify the outputs as well.
$ xvc pipeline step output --step-name hello --output-file hello.txt
$ xvc pipeline step output --step-name world --output-file world.txt
$ xvc pipeline step output --step-name from --output-file from.txt
$ xvc pipeline step output --step-name xvc --output-file xvc.txt
Now we can add dependencies between them.
$ xvc pipeline step dependency --step-name xvc --step from
$ xvc pipeline step dependency --step-name from --file world.txt
$ xvc pipeline step dependency --step-name world --step hello
Now the pipeline looks like this:
$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
from: echo from >> from.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)
$ xvc pipeline dag --format mermaid
flowchart TD
n1["hello.txt"] --> n0
n0["hello"] --> n2
n3["world.txt"] --> n2
n3["world.txt"] --> n4
n5["from.txt"] --> n4
n4["from"] --> n6
n7["xvc.txt"] --> n6
When we remove a step, all its dependencies and outputs are removed as well.
$ xvc -vv pipeline step remove --step-name from
[INFO] Removing dep: file(world.txt)
[INFO] Removing dep step(from) from xvc
[INFO] Removing output: File
[INFO] Removing step: from
$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)
$ xvc pipeline dag --format mermaid
flowchart TD
n1["hello.txt"] --> n0
n0["hello"] --> n2
n3["world.txt"] --> n2
n5["xvc.txt"] --> n4
xvc pipeline run
$ xvc pipeline run --help
Run a pipeline
Usage: xvc pipeline run [OPTIONS]
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
Pipelines require Xvc to be initialized before running.
$ git init
$ xvc init
Xvc defines a default pipeline and any steps added without specifying the pipeline will be added to it.
$ xvc pipeline list
| Name | Run Dir |
| default | |
Create a new step in this pipeline with xvc pipeline step new
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline dag --format=mermaid
flowchart TD
You can run the default pipeline without specifying its name.
$ xvc pipeline run
[OUT] [hello] hello
[DONE] [hello] (echo hello)
Note that, when a step has no dependencies, it's set to always run if it's not set to run never explicitly.
$ xvc pipeline step update --step-name hello --when never
$ xvc pipeline run
Run a specific pipeline
You can run a specific pipeline by specifying its name with --name
$ xvc pipeline new --pipeline-name my-pipeline
$ xvc pipeline --pipeline-name my-pipeline step new --step-name my-hello --command "echo 'hello from my-pipeline'"
$ xvc pipeline run --pipeline-name my-pipeline
[OUT] [my-hello] hello from my-pipeline
[DONE] [my-hello] (echo 'hello from my-pipeline')
xvc pipeline delete
$ xvc pipeline delete --help
Delete a pipeline
Usage: xvc pipeline delete [OPTIONS]
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
xvc pipeline export
$ xvc pipeline export --help
Export the pipeline to a YAML or JSON file to edit
Usage: xvc pipeline export [OPTIONS]
--file <FILE> File to write the pipeline. Writes to stdout if not set
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
--format <FORMAT> Output format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
-h, --help Print help
You can export the pipeline you created to a JSON or YAML file to edit and restore using xvc pipeline import
. This allows to fix typos and update commands in place, and see pipeline internals
for debugging.
Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.
This command works only in Xvc repositories.
$ git init
$ xvc init
Let's start by defining a steps in the pipeline.
$ xvc pipeline step new --step-name step1 --command 'touch abc.txt'
$ xvc pipeline step new --step-name step2 --command 'touch def.txt'
Adding a few dependencies.
$ xvc pipeline step dependency -s step2 --step step1
$ xvc pipeline step dependency -s step2 --glob '*.txt'
$ xvc pipeline step dependency -s step2 --glob-items '*.txt'
$ xvc pipeline step dependency -s step2 --param model.conv_units
$ xvc pipeline step dependency -s step2 --regex requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --regex-items requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --line-items params.yaml::1-20
$ xvc pipeline step dependency -s step2 --lines params.yaml::1-20
$ xvc pipeline step dependency -s step2 --url ''
$ xvc pipeline step dependency -s step2 --generic 'ping -c 2'
$ xvc pipeline step output -s step2 --output-metric metrics.json
$ xvc pipeline step output -s step2 --output-file def.txt
$ xvc pipeline step output -s step2 --output-image plots/confusion.png
If you don't specify a filename, the default format is JSON and the output will be sent to stdout.
$ xvc pipeline export
"name": "default",
"steps": [
"command": "touch abc.txt",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "step1",
"outputs": []
"command": "touch def.txt",
"dependencies": [
"Step": {
"name": "step1"
"Generic": {
"generic_command": "ping -c 2",
"output_digest": null
"GlobItems": {
"glob": "*.txt",
"xvc_path_content_digest_map": {},
"xvc_path_metadata_map": {}
"Glob": {
"content_digest": null,
"glob": "*.txt",
"xvc_metadata_digest": null,
"xvc_paths_digest": null
"RegexItems": {
"lines": [],
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
"Regex": {
"lines_digest": null,
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
"Param": {
"format": "YAML",
"key": "model.conv_units",
"path": "params.yaml",
"value": null,
"xvc_metadata": null
"LineItems": {
"begin": 1,
"end": 20,
"lines": [],
"path": "params.yaml",
"xvc_metadata": null
"Lines": {
"begin": 1,
"digest": null,
"end": 20,
"path": "params.yaml",
"xvc_metadata": null
"UrlDigest": {
"etag": null,
"last_modified": null,
"url": "",
"url_content_digest": null
"invalidate": "ByDependencies",
"name": "step2",
"outputs": [
"File": {
"path": "def.txt"
"Metric": {
"format": "JSON",
"path": "metrics.json"
"Image": {
"path": "plots/confusion.png"
"version": 1,
"workdir": ""
If you want to set the format, you can specify the --format
$ xvc pipeline export --format yaml
version: 1
name: default
workdir: ''
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
- !Step
name: step1
- !Generic
generic_command: ping -c 2
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
etag: null
last_modified: null
url_content_digest: null
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
When you specify a file name, the output format is inferred from the extension.
$ xvc pipeline export --file pipeline.yaml
$ cat pipeline.yaml
version: 1
name: default
workdir: ''
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
- !Step
name: step1
- !Generic
generic_command: ping -c 2
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
etag: null
last_modified: null
url_content_digest: null
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
xvc pipeline import
$ xvc pipeline import --help
Import the pipeline from a file
Usage: xvc pipeline import [OPTIONS]
--file <FILE> File to read the pipeline. Use stdin if not specified
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
--format <FORMAT> Input format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
--overwrite Overwrite the pipeline even if the name already exists
-h, --help Print help
This command is used to import pipelines exported with xvc pipeline export
You can edit and import the pipelines exported with the command.
Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.
This command works only in Xvc repositories.
$ git init
$ xvc init
The following file generated with xvc pipeline export
$ cat pipeline.yaml
version: 1
name: default
workdir: ''
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
- !Step
name: step1
- !Generic
generic_command: ping -c 2
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
etag: null
last_modified: null
url_content_digest: null
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
You can import this file to construct the pipeline at once.
Note that the export
command outputs JSON by default.
$ xvc pipeline import --file pipeline.yaml --overwrite
$ xvc pipeline export
"name": "default",
"steps": [
"command": "touch abc.txt",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "step1",
"outputs": []
"command": "touch def.txt",
"dependencies": [
"Step": {
"name": "step1"
"Generic": {
"generic_command": "ping -c 2",
"output_digest": null
"GlobItems": {
"glob": "*.txt",
"xvc_path_content_digest_map": {},
"xvc_path_metadata_map": {}
"Glob": {
"content_digest": null,
"glob": "*.txt",
"xvc_metadata_digest": null,
"xvc_paths_digest": null
"RegexItems": {
"lines": [],
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
"Regex": {
"lines_digest": null,
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
"Param": {
"format": "YAML",
"key": "model.conv_units",
"path": "params.yaml",
"value": null,
"xvc_metadata": null
"LineItems": {
"begin": 1,
"end": 20,
"lines": [],
"path": "params.yaml",
"xvc_metadata": null
"Lines": {
"begin": 1,
"digest": null,
"end": 20,
"path": "params.yaml",
"xvc_metadata": null
"UrlDigest": {
"etag": null,
"last_modified": null,
"url": "",
"url_content_digest": null
"invalidate": "ByDependencies",
"name": "step2",
"outputs": [
"File": {
"path": "def.txt"
"Metric": {
"format": "JSON",
"path": "metrics.json"
"Image": {
"path": "plots/confusion.png"
"version": 1,
"workdir": ""
If you don't supply the --overwrite
option, Xvc will report an error and quit.
$ xvc pipeline import --file pipeline.yaml
? 1
[ERROR] Pipeline Error: Pipeline default already found
Error: PipelineError { source: PipelineAlreadyFound { name: "default" } }
You can specify a new name for the pipeline and it will override the name set in the file. This way you can edit and import similar pipelines with minor differences.
$ xvc pipeline import --pipeline-name another-pipeline --file pipeline.yaml
You can also use stdin to import a pipeline but you must specify the input format.
xvc pipeline update
$ xvc pipeline update --help
Update the name and other attributes of a pipeline
Usage: xvc pipeline update [OPTIONS]
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
--rename <RENAME> Rename the pipeline to
--workdir <WORKDIR> Set the working directory
--set-default set this pipeline default
-h, --help Print help
xvc pipeline dag
$ xvc pipeline dag --help
Generate a Graphviz or mermaid diagram of the pipeline
Usage: xvc pipeline dag [OPTIONS]
--file <FILE> Output file. Writes to stdout if not set
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
--format <FORMAT> Format for graph. Either graphviz or mermaid [default: graphviz]
-h, --help Print help
You can visualize the pipeline you defined with xvc pipeline set of command with the xvc pipeline dag
command. It will generate a dot or mermaid diagram for the pipeline.
As all other pipeline commands, this requires an Xvc repository.
$ git init --initial-branch=main
Initialized empty Git repository in [CWD]/.git/
$ xvc init
All steps of the pipeline are shown as nodes in the graph.
We create a dependency between the two steps by using the --dependencies
flag to make them run sequentially.
$ xvc pipeline step new --step-name preprocess --command "echo 'preprocess'"
$ xvc pipeline step new --step-name train --command "echo 'train'"
$ xvc pipeline step dependency --step-name train --step preprocess
It's not very readable but you can supply the result directly to dot and get a more useful output.
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n1;}
The output after dot -Tsvg
When you add a dependency between two steps, the graph shows it as a node. For example,
$ xvc pipeline step dependency --step-name preprocess --glob 'data/*'
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=folder;label="data/*";];n1->n0;n2[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n2;}
You can use --mermaid
option to get a mermaid.js diagram.
$ xvc pipeline dag --format=mermaid
flowchart TD
n1["data/*"] --> n0
n0["preprocess"] --> n2
The output can be used in Mermaid Live Editor or any web page that support the format.
Storage management commands (xvc storage
Xvc allows to keep tracked content in storages.
These can be in either local file system or the cloud.
xvc storage
set of commands allow to configure, list and delete these storages.
$ xvc storage --help
Storage (cloud) management commands
Usage: xvc storage <COMMAND>
list List all configured storages [aliases: l]
remove Remove a storage configuration [aliases: R]
new Configure a new storage [aliases: n]
help Print this message or the help of the given subcommand(s)
-h, --help Print help
xvc storage list
List all configured storages with their names and guids.
$ xvc storage list --help
List all configured storages
Usage: xvc storage list
-h, --help Print help
The command works only in Xvc repositories.
$ git init
$ xvc init
Define two local storages:
$ xvc storage new local --name backup-1 --path '../backup-1'
$ xvc storage new local --name backup-2 --path '../backup-2'
You can list the storages and their GUIDs.
$ xvc storage list
Local: backup-1 [..] ../backup-1
Local: backup-2 [..] ../backup-2
This one uses the local configuration and doesn't try to connect storages.
If a storage is listed, it doesn't mean it's guaranteed to be able to pull or push.
Xvc never stores credentials for storages.
xvc storage remove
Remove unused or inaccessible storages from the configuration
$ xvc storage remove --help
Remove a storage configuration.
This doesn't delete any files in the storage.
Usage: xvc storage remove --name <NAME>
-n, --name <NAME>
Name of the storage to be deleted
-h, --help
Print help (see a summary with '-h')
The command works only in Xvc repositories.
$ git init
$ xvc init
Define two local storages:
$ xvc storage new local --name backup-1 --path '../backup-1'
$ xvc storage new local --name backup-2 --path '../backup-2'
You can list the storages and their GUIDs.
$ xvc storage list
Local: backup-1[..]../backup-1
Local: backup-2[..]../backup-2
Now when we remove backup-1
and get the list, only one of them is listed.
$ xvc storage remove --name backup-1
Removed Storage Local: backup-1[..]../backup-1
$ xvc storage list
Local: backup-2[..]../backup-2
This one uses the local configuration and doesn't try to connect storages.
If a storage is listed, it doesn't mean it's guaranteed to be able to pull or push.
Xvc never stores credentials for storages.
xvc storage new
$ xvc storage new --help
Configure a new storage
Usage: xvc storage new <COMMAND>
local Add a new local storage
generic Add a new generic storage
rsync Add a new rsync storages
s3 Add a new S3 storage
minio Add a new Minio storage
digital-ocean Add a new Digital Ocean storage
r2 Add a new R2 storage
gcs Add a new Google Cloud Storage storage
wasabi Add a new Wasabi storage
help Print this message or the help of the given subcommand(s)
-h, --help Print help
xvc storage new local
Create a new storage reachable from the local filesystem. It allows to keep tracked file contents in a different directory for backup or sharing purposes.
$ xvc storage new local --help
Add a new local storage
A local storage is a directory accessible from the local file system. Xvc will use common file operations for this directory without accessing the network.
Usage: xvc storage new local --path <PATH> --name <NAME>
--path <PATH>
Directory (outside the repository) to be set as a storage
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-h, --help
Print help (see a summary with '-h')
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
Now, you can define a local directory as storage and begin to use it.
$ xvc storage new local --name backup --path '../my-local-storage'
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
--name NAME
is not checked to be unique but you should use unique storage names to refer them later.
--path PATH
should be accessible for writing and shouldn't already exist.
Technical Details
The command creates the PATH
and a new file under PATH
called .xvc-guid
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is saved to PATH/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}
is the unique identifier for the repository created during xvc init
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication. (yet)
In the future, there may be an option to have a common storage for multiple projects at the same location. Please comment below if this is a common use case.
xvc storage new generic
Create a new storage that uses shell commands to send and retrieve cache files. It allows to keep tracked files in any kind of service that can be used command line.
$ xvc storage new generic --help
Add a new generic storage.
⚠️ Please note that this is an advanced method to configure storages. You may damage your repository and local and storage files with incorrect configurations.
Please see for examples and make necessary backups.
Usage: xvc storage new generic [OPTIONS] --name <NAME> --init <INIT_COMMAND> --list <LIST_COMMAND> --download <DOWNLOAD_COMMAND> --upload <UPLOAD_COMMAND> --delete <DELETE_COMMAND>
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-i, --init <INIT_COMMAND>
Command to initialize the storage. This command is run once after defining the storage.
You can use {URL} and {STORAGE_DIR} as shortcuts.
-l, --list <LIST_COMMAND>
Command to list the files in storage
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-d, --download <DOWNLOAD_COMMAND>
Command to download a file from storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-u, --upload <UPLOAD_COMMAND>
Command to upload a file to storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-D, --delete <DELETE_COMMAND>
The delete command to remove a file from storage You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options
-M, --processes <MAX_PROCESSES>
Number of maximum processes to run simultaneously
[default: 1]
--url <URL>
You can set a string to replace {URL} placeholder in commands
--storage-dir <STORAGE_DIR>
You can set a string to replace {STORAGE_DIR} placeholder in commands
-h, --help
Print help (see a summary with '-h')
You can use the following placeholders in your commands. These are replaced with the actual paths in runtime and commands are run with concrete paths.
: The content of--url
option. (default ""){STORAGE_DIR}
Content of--storage-dir
option. (default ""){RELATIVE_CACHE_PATH}
The portion of the cache path after.xvc/
The absolute local path for the cache element.{RELATIVE_CACHE_DIR}
The portion of directory that contains the file after.xvc/
The portion of the local directory that contains the file after.xvc
: Repository GUID used in storages to differ repository elements{FULL_STORAGE_PATH}
: The path that contains guid of the storage locally. Used only in--init
: The path that should have guid of the storage, in storage. Used only in--init
Create a generic storage in the same filesystem
You can create a storage that's using shell commands to send and receive files to another location in the file system.
There are two variables that you can use in the commands.
For a storage in the same file system, --url
could be blank and --storage-dir
could be the location you want to define.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
You need to specify the commands for the following operations:
: The command that's used to create the directory that will be used as a storage. It should also copyXVC_STORAGE_GUID_FILENAME
) to that location. This file is used to identify the location as an Xvc storage.
$ xvc storage new-generic
Note that if the command doesn't contain {LOCAL_GUID_FILE_PATH}
variables, it won't be run and Xvc will report an error.
: This operation should list all files under{URL}{STORAGE_DIR}
. The list is filtered through a regex that matches the format of the paths. Hence, even the command lists all files in the storage, Xvc will consider only the relevant paths.
All paths should be listed in separate lines.
$ xvc storage new-generic
--list 'ls -1 {URL}{STORAGE_DIR}'
: The command that will copy a file from local cache to the storage. Normally, it uses{ABSOLUTE_CACHE_PATH}
variable. For the local file system, we also need to create a directory before copying.
$ xvc storage new-generic
: This command will be used to copy from storage to the local cache. It must create local cache directory as well.
$ xvc storage new-generic
: This operation is used to delete the storage file. It shouldn't touch the local file in any way, otherwise you may lose data.
$ xvc storage new-generic
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
In total, the command you write is the following. It defines all operations of this storage.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
--list 'ls -1 {URL}{STORAGE_DIR}'
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
Create a storage using rsync
Rsync is found for all popular platforms to copy file contents. Xvc can use it to maintain a storage if you already have a working rsync setup.
We need to define operations for init
, upload
, download
, list
and delete
with rsync or ssh.
Some of the commands need ssh
to perform operations, like creating a directory.
We'll use placeholders for paths.
As rsync URL format is slightly different than SSH, we will define the commands verbosely.
Suppose you want to use your account at
to store your Xvc files.
You want to store the files under /home/user/my-xvc-storage
We assume you have configured public key authentication for your account. Xvc doesn't receive user input during storage operations, and can't receive your password during runs.
We first define these as our --url
and --storage-dir
$ xvc --url
--storage-dir '/home/user/my-xvc-storage'
Initialization command must create this directory and copy the storage GUID file to its respective location.
$ xvc
--init "ssh {URL} 'mkdir -p {STORAGE_DIR}' ; rsync -av '{LOCAL_GUID_FILE_PATH}' '{URL}:{STORAGE_GUID_FILE_PATH}'"
Note the use of :
in rsync
As it doesn't support ssh://
URLs currently, we are using a form that's compatible with both ssh and rsync as URL.
It may be possible to use &&
between ssh
and rsync
commands, but if the first command fails (e.g. the directory already exists), we still want to copy the guid file.
Technical Details
The paths in list
commands are filtered through a regex.
They are matched against {REPO_GUID}/{RELATIVE_CACHE_DIR}/0
pattern and only the {RELATIVE_CACHE_DIR}
portion is reported.
Any line that doesn't conform to this pattern is ignored.
You can any listing command that returns a recursive file list, and only the pattern matching elements are considered.
xvc storage new s3
Configure an S3 (or a compatible) service as an Xvc storage.
$ xvc storage new rsync --help
Add a new rsync storages
Uses rsync in separate processes to communicate. This can be used when you already have an SSH/Rsync connection. It doesn't prompt for any passwords. The connection must be set up with ssh keys beforehand.
Usage: xvc storage new rsync [OPTIONS] --name <NAME> --host <HOST> --storage-dir <STORAGE_DIR>
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
--host <HOST>
Hostname for the connection in the form (without @, : or protocol)
--port <PORT>
Port number for the connection in the form 22. Doesn't add port number to connection string if not given
--user <USER>
User name for the connection, the part before @ in (without @, hostname). User name isn't included in connection strings if not given
--storage-dir <STORAGE_DIR>
storage directory in the host to store the files
-h, --help
Print help (see a summary with '-h')
You must setup an SSH connection
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new rsync --name backup --host --user iex --storage-dir /tmp/xvc-backup/
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc storage new s3
Configure an S3 (or a compatible) service as an Xvc storage.
$ xvc storage new s3 --help
Add a new S3 storage
Reads credentials from `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new s3 [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
--bucket-name <BUCKET_NAME>
S3 bucket name
--region <REGION>
AWS region
-h, --help
Print help (see a summary with '-h')
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.AWS_SECRET_ACCESS_KEY
: The secret key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new s3 --name backup --bucket-name xvc-test --region eu-central-1 --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc storage new gcs
Configure an Google Cloud Storage service as an Xvc storage.
$ xvc storage new gcs --help
Add a new Google Cloud Storage storage
Reads credentials from `GCS_ACCESS_KEY_ID` and `GCS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new gcs [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server, e.g., europe-west3
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Please configure S3 compatible interface to your Google Cloud Storage account before using this command.
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.GCS_SECRET_ACCESS_KEY
: The secret key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new gcs --name backup --bucket-name xvc-test --region europe-west-3 --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc storage new minio
Create a new Xvc storage on a MinIO instance. It allows to store tracked file contents in a Minio server.
$ xvc storage new minio --help
Add a new Minio storage
Reads credentials from `MINIO_ACCESS_KEY` and `MINIO_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new minio [OPTIONS] --name <NAME> --endpoint <ENDPOINT> --bucket-name <BUCKET_NAME> --region <REGION>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--endpoint <ENDPOINT>
Minio server url in the form
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.MINIO_SECRET_ACCESS_KEY
: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new minio --name backup --endpoint --bucket-name xvc-tests --region us-east-1 --storage-prefix xvc
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
--name NAME
is not verified to be unique but you should use unique storage names to refer them later.
You can also use storage GUIDs listed by xvc storage list
to refer to storages.
You must have a valid connection to the server.
Xvc uses Minio API port (9001, by default) to connect to the server. Ensure that it's accessible.
For reasons caused from the underlying library, Xvc tries to connect
if you give
as the endpoint, and xvc-bucket
as the bucket name.
You may need to consider this when you have servers running in exact URLs.
If you have a
as a Minio server, you may want to supply
as the endpoint, and minio
as the bucket name to form the correct URL.
This behavior may change in the future.
Technical Details
This command requires Xvc to be compiled with minio
feature, which is on by default.
It uses Rust async features via rust-s3
crate, and may add some bulk to the binary.
If you want to compile Xvc without these features, please refer to How to Compile Xvc document.
The command creates .xvc-guid
file in http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/.xvc-guid
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is the unique identifier for the repository created during xvc init
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication.
xvc storage new r2
Use Cloudflare R2 as an Xvc storage.
$ xvc storage new r2 --help
Add a new R2 storage
Reads credentials from `R2_ACCESS_KEY_ID` and `R2_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new r2 [OPTIONS] --name <NAME> --account-id <ACCOUNT_ID> --bucket-name <BUCKET_NAME>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--account-id <ACCOUNT_ID>
R2 account ID
--bucket-name <BUCKET_NAME>
Bucket name
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Cloudflare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.R2_SECRET_ACCESS_KEY
: The secret key of the Cloudfare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new r2 --name backup --bucket-name xvc-test --account-id e5dcca29209558eb9de6c07ae53b0a6f --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc storage new wasabi
Configure a Wasabi service as an Xvc storage.
$ xvc storage new wasabi --help
Add a new Wasabi storage
Reads credentials from `WASABI_ACCESS_KEY_ID` and `WASABI_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new wasabi [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--endpoint <ENDPOINT>
Endpoint for the server, complete with the region if there is
e.g. for eu-central-1 region, use as the endpoint.
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.WASABI_SECRET_ACCESS_KEY
: The secret key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new wasabi --name backup --bucket-name xvc-test --endpoint --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc storage new digital-ocean
Configure a Digital Ocean Spaces service as an Xvc storage.
$ xvc storage new digital-ocean --help
Add a new Digital Ocean storage
Reads credentials from `DIGITAL_OCEAN_ACCESS_KEY_ID` and `DIGITAL_OCEAN_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new digital-ocean [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Before calling any commands that use this storage, you must set the following environment variables.
: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.DIGITAL_OCEAN_SECRET_ACCESS_KEY
: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new digital-ocean --name backup --bucket-name xvc --region fra1 --storage-prefix xvc
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
$ xvc file remove --from-storage backup dir-0001/
xvc root
Shows the Xvc root project directory where .xvc/
$ xvc root --help
Find the root directory of a project
Usage: xvc root [OPTIONS]
--absolute Show absolute path instead of relative
-h, --help Print help
xvc root
can be used in scripts to make paths relative to the Xvc project root.
By default, it shows the relative path.
$ xvc root
When you supply --absolute
, it prints the absolute path.
$ xvc root --absolute
xvc check-ignore
Check whether a path is ignored or whitelisted by Xvc.
$ xvc check-ignore --help
Check whether files are ignored with `.xvcignore`
Usage: xvc check-ignore [OPTIONS] [TARGETS]...
Targets to check. If no targets are provided, they are read from stdin
--ignore-filename <IGNORE_FILENAME>
Filename that contains ignore rules
This can be set to .gitignore to test whether Git and Xvc work the same way.
[default: .xvcignore]
-h, --help
Print help (see a summary with '-h')
$ git init
$ xvc init
You can add files and directories to be ignored by Xvc to .xvcignore
$ zsh -cl "echo 'my-dir/my-file' >> .xvcignore"
By default it checks the files supplied from stdin
$ zsh -cl 'echo my-dir/my-file | xvc check-ignore'
[IGNORE] [CWD]/my-dir/my-file
The .xvcignore
file format is identical to .gitignore
file format.
$ cat .xvcignore
# Add patterns of files xvc should ignore, which could improve
# the performance.
# It's in the same format as .gitignore files.
If you supply paths from the CLI, they are checked against the ignore rules in .xvcignore
$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[NO MATCH] [CWD]/another-dir/another-file
You can also add whitelist patterns to ,.xvcignore
$ zsh -cl "echo '!another-dir/*' >> .xvcignore"
$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[WHITELIST] [CWD]/another-dir/another-file
This utility can be used to check any other ignore rules in other files as well.
You can specify an alternative ignore filename with --ignore-filename
The below command is identical to git check-ignore
and should give the same results.
$ xvc check-ignore --ignore-filename .gitignore
This command is not implemented yet. Please see for its progress.
xvc completions
Rust API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
See for latest version of the Xvc API
Xvc Architecture
The malleability of the material (bits and bytes) we're working with leads to difficulties in architecting software. Unlike real architecture, bits and bytes don't bring natural restrictions. It's not possible to build skyscrapers with mud bricks, and our material is much more malleable. There are too many options, too many ways to solve problems that it's easy to merge in technical mud with the decisions we make.
Software developers created a set of architectural principles to overcome this unlimitation. Most of these principles are bogus. They are not tested on the field. We seldom have software that's still perfectly maintainable after ten years. Usually, reading and understanding the code is more difficult than coming up with a new solution and rewriting it.
In this chapter, we describe the problems, assumptions, and solutions in Xvc's intended domain. It's a work in progress but should give you ideas about the intentions behind decisions.
After two decades, I (un)learned a few basic principles regarding software development.
Object Oriented Programming doesn't work. Mixing data and functions (methods) isn't a good way to write programs. It leads to artificial layers and structures that become burdensome the long run. It forces the developer to think about both the data and functionality at the same time. This makes reasoning and solving the problem harder than it should be.
Data structures are more important than algorithms. Using a few distinct, well thought data structures is more important than creating the best algorithm. Algorithms are replaceable locally without much peripheral impact. Modifying data structures usually requires updates to all related elements.
DRY is overrated. It may be a good principle after you write the first version. However, during the actual development phase, it's not a good idea to try not to repeat yourself. What parts of the program repeat, what parts rhyme, and what should be abstracted can be seen after we write the whole. Trying to apply abstract principles to exploratory development hinders the ability to solve problems as plainly as possible.
More errors are done in the name of abstraction than the reverse. Abstractions don't always help. They usually distribute a single functionality across arbitrary layers. In the age of LSP, it's easier to find repeating functionality and merge/rewrite, rather than fixing incorrect assumptions about abstractions. Problems with repeating code are obvious and easier to fix than problems with abstractions.
Vertical architecture is more important than horizontal architecture. Vertical architecture means the lower the number of layers between the user and their intention, the better. If the user wants to copy a file, creating a layer of abstract classes to make this more modular doesn't result in more resilient software. If you want to detect whether we're in a Git repository, checking the presence of
directory is simpler than creating a few abstract classes that work for more than one SCM, and implementing abstract methods for them. The architecture shouldn't try to satisfy abstract patterns, it should make the path between the user's action and effect as direct as possible.
Xvc Modules (Crates)
Xvc is composed of modules that can be tested and used independently.
module is in the middle of the architecture.
Lower-level crates interface with the OS and convert these to data structures.
Higher levels use these data structures to implement functionality.
For example xvc-walker
crate interfaces with the directories and paths, ignore rules and serves a set of paths with their metadata.
crate uses these to check whether a file is changed or not.
: Logger definitions and debugging macros.walker
: A file system directory walker that checks ignore files. It can also notify the changes in the directory via channels after the initial traversal.config
: Configuration framework that loads configuration from various levels (Default, System, User, Project, Environment) and merges these with command line options for each module.ecs
: The entity-component system responsible for saving and loading state of all data structures, along with their associations and queries.
: Commands and functionality to configure external (local or cloud) locations to store file content.
: Xvc specific data structures and utilities.
All user level modules use this module for shared functionality.
: Commands to track files and utilities around file management.pipeline
: Commands to define data pipelines as DAGs and run them.
The current dependency graph where lower-level modules are used directly is this:
After the crate interfaces are stabilized, all lower-level functions will be reused from xvc-core
It will provide the basic Xvc API.
In this case, the graph will be simplified.
Any improvement in user-level API will be done higher than xvc-core
Any improvement in lower-level modules will be done in dependencies of xvc-core
Xvc is an CLI MLOps tool to track file, data, pipeline, experiment, model versions.
It has the following goals:
- Enable to track any kind of files, including large binary, data and models in Git.
- Enable to get subset of these files.
- Enable to remove files from workspace temporarily, and retrieve them from cache.
- Enable to upload and download these files to/from a central server.
- Enable users to run pipelines composed of commands.
- Be able to invalidate pipelines partially.
- Enable to run a pipeline or arbitrary commands as experiments, and store and retrieve them.
Xvc users are data and machine learning professionals that need to track large amounts of data. They also want to run arbitrary commands on this data when it changes. Their goal is to produce better machine learning models and better suited data for their problems.
We have three quality goals:
- Robustness: The system should be robust for basic operations.
- Performance: The overall system performance must be within the ballpark of usual commands like
. - Availability: The system must run on all major operating systems.
Xvc users work with large amounts of data. They want to depend on Xvc for basic operations like tracking file versions, and uploading these to a central location.
They don't want to wait too long for these operations on common hardware.
They would like to download their data to any system running various operating systems.
Xvc Cache
The cache is where Xvc copies the files it tracks.
It's located under the .xvc
Instead of the file tree that's normally used to address files, it uses the content digest of files to organize them.
In a standard file hierarchy, we have files in paths like /home/iesahin/Photos/my-photo.png
Xvc doesn't use such a tree in its cache.
It uses paths like .xvc/b3/a12/b45/d789a...f54/0.png
to refer to files.
Producing the cache path from its content causes cache paths to change when the files are updated.
For example, in a standard file system, if you save another photo on top of my-photo.png
, the first version will be
Xvc stores these two versions in different locations in the cache, so they are not lost.
There are 4 parts of this cache path.
part is the standard directory xvc init
command creates. It resides in the root folder of your project.
denotes the [digest type] of the content digest.
Xvc supports more than one algorithm to calculate content digests.
[HashAlgorithm][] enum shows which algorithms are supported.
Each of these algorithms has a 2-letter prefix.
: BLAKE3b2
: BLAKE2ss3
: SHA2-256s2
: SHA3-256
Note that, all these digest algorithms produce 256bits/32 bytes digests. This digest is converted to 64 hexadecimal digits. To keep the total path length shorter, Xvc requires digests to be 32 bytes in length.
The third part in the cache path is these 64 hexadecimal digits in the form a12/b45/d789...f54/
64 digits are split into directories to keep the number of directories under one directory lower.
Had Xvc put all cache elements in a single directory, it could lead to degraded performance in some file systems.
With this arrangement, b3/
can contain at most 4096 directories, that contain 4096 directories each.
With usual distribution and good hash algorithms, there won't be more than 4000 elements per directory until 64 billion
files are in the cache. (4000³)
The fourth part is the 0.png
part, that's the file itself with the same extension but with 0
as the basename.
Xvc uses digest as a directory instead of the file name.
There may be times when the file in the cache should be used manually, on cloud storage for example.
The extension is kept for this reason, to make sure that the OS recognizes the file type correctly.
The rename to 0
means, that this is the whole file.
In the future, when Xvc will support splitting large files to transfer to remotes, all parts of the file will be put into this directory.
Storages also use the same cache structure, with an added GUID
part to use single storage for multiple projects.
The Architecture of Xvc Entity Component System
Xvc uses an entity component system (ECS) in its core. ECS architecture is popular among game development, but didn't find popularity in other areas. It's an alternative to Object-Oriented Programming.
There are a few basic notions of ECS architecture. Although it may differ in other frameworks, Xvc assumes the following:
An entity is a neutral way of tracking components and their relationships. It doesn't contain any semantics other than being an entity. An entity in Xvc is an atomic integer tuple. (
) -
A component is a bundle of associated data about these entities. All semantics of entities are described through components. Xvc uses components to keep track of different aspects of file system objects, dependencies, storages, etc.
A system is where the components are created and modified. Xvc considers all modules that interact with components as separate systems.
Suppose you want to track a new file in Xvc.
Xvc creates a new entity for this file.
Associates the path (XvcPath
) with this entity.
Creates an instance of XvcMetadata
that represent file size and timestamp, and associates it with this entity.
An XvcDigest
struct is associated with the entity to show the file's content digest.
The difference from OOP is that there is no basic or main object. There is no file
object that contains a
, or a directory
object that is inherited from files.
If you want to work only with digests and want to find the workspace paths associated with them, you can write a
function (system in Entity-Component-System) that starts from XvcDigest
records and collect the associated paths.
If you want to get only the files larger than a certain size, you can work with XvcMetadata
, filter them and get the paths later.
In contrast, in an OOP setting, these data are associated with paths and when you want to do such operations, you need to load paths and their associations first.
OOP way of doing things is usually against the principle of locality.
The whole idea is to be flexible for further changes.
For example, these days Xvc doesn't have notions of data and models. Files are just files.
It doesn't have different functionality for files that are models or data.
When this distinction will be added, an XvcModel
component will be created and associated with the same entity of an
, a set of XvcFeatures
will be associated in the same way XvcMetadata
is associated with XvcPath
It will allow working with some paths as model files but it won't require paths to be known beforehand.
There may be other metadata, like features or version associated with models that are more important.
There may be some models without a file system path, maybe living only in memory or in the cloud.
In contrast, OOP would define this either by inheritance (a model is a path) or containment (a model has a path). When you select any of these, it becomes a relationship that must be maintained indefinitely. When you only have an integer that identifies these components, it's much easier to describe models without a path later. There is no predefined relationship between paths and models. You can have paths without models, or models without paths.
The architecture is approximately similar to database modeling. Components are in-memory tables, albeit they are small and mostly contain a few fields. Entities are numeric primary keys. Systems are insert, query and update mechanisms.
An XvcStore
in its basic definition is a map structure between XvcEntity
and a component type T
It has facilities for persistence, iteration, search and filtering.
It can be considered a system in the usual ECS sense.
Loading and Saving Stores
As our goal is to track data files with Git, stores save and load binary files' metadata to text files. Instead of storing the binary data itself in Git, Xvc stores information about these files to track whether they are changed. By default, these metadata are persisted to JSON. Component types must be serializable because of this. They are meant to be stored to disk in JSON format. Nevertheless, as they are almost always composed of basic types [serde] supports, this doesn't pose a difficulty in usage. The JSON files are then commit to Git.
Note that, there are usually multiple branches in Git repositories. Also multiple users may work on the same branch.
When these text files are reused by the stores, they are modified and this may lead to merge conflicts. We don't want our users to deal with merge conflicts with entities and components in text files. This also makes it possible to use binary formats like MessagePack in the future.
Suppose user A made a change in XvcStore<XvcPath>
by adding a few files.
Another user B made another change to the project, by adding another set of files in another copy of the project.
This will lead to merge conflicts:
counter will have different values in A and B's repositories.XvcStore<XvcPath>
will have different records in A and B's repositories.
Instead of saving and loading to monolithical files, XvcStore
saves and loads event logs.
There are two kind of events in a store:
Add(XvcEntity, T)
: Adds an elementT
to a store.Remove(XvcEntity)
: Removes the element with entity id.
These events are saved into files. When the store is loaded, all files after the last full snapshot are loaded and replayed.
When you add an item to a store, it saves the Add
event to a log.
These events are then put into a vector.
A BTreeMap
is also created by this vector.
When an item is deleted, a Remove
event is added to the event vector.
While loading, stores removes the elements with Remove
events from the BTreeMap
So the final set of elements doesn't contain the removed item.
The second problem with multiple branches is duplicate entities in separate branches. Xvc uses a counter to generate unique entity ids. When a store is loaded, it checks the last entity id in the event log and uses it as the starting point for the counter. But using this counter as is causes duplicate values in different branches. Xvc solves this by adding a random value to these counter values.
Since v0.5, XvcEntity
is a tuple of 64-bit integers. The first is loaded from
the disk and is an atomic counter. The second is a random value that is renewed
at every command invocation. Therefore we have a unique entity id for every run,
that's also sortable by the first value. Easy sorting with integers is sometimes
required for stable lists.
Inverted Index
Stores also have a inverted index for quick lookup.
They store value of T
as key and a list of entities that correspond to this key.
For example, when we have a path that we stored, it's a single operation to get the corresponding XvcEntity
and after this, all recorded metadata about this path is available.
All search, iteration and filtering functionality is performed using these two internal maps.
In summary, a store has four components.
- An immutable log of previous events:
- A mutable log of current events:
- A mutable map of the current data:
BTreeMap<XvcEntity, T>
- A mutable map of the entities from values:
BTreeMap<T, Vec<XvcEntity>>
Note that, when two branches perform the same operation, the event logs will be
different, as the random part of XvcEntity
is different. When two parties
branches merge, the inverted index may contain conflicting values. In this case,
a fsck
command is used to merge the store files and merge conflicting entity
Insert, update and delete operations affect mutable log and maps.
Queries, iteration and such non-destructive operations are done with the maps.
When loading, all log files are merged in immutable log.
No standard operation touches the event logs.
All log modifications are done outside of the normal worflow.
When saving, only the mutable log is saved.
Note that only can only be added to the log, they are not removed.
(See xvc fsck --merge-stores
for merging store files.)
Relationship Stores
keeps component-per-entity.
Each component is a flat structure that doesn't refer to other components.
Xvc also has relation stores that represent relationships between entities, and components. Similar to the database Entity-Relationship model, there are three kinds of the relationship store:
R11Store<T, U>
keeps two sets of components associated with the same entity.
It represents a 1-1 relationship between T
and U
It contains two XvcStore
s for each component type.
These two stores are indexed with the same XvcEntity
For example, an R11Store<XvcPath, XvcMetadata>
keeps track of path metadata for the identical XvcEntity
R1NStore<T, U>
keeps parent-child relationships.
It represents a 1-N relationship between T
and U
On top of two XvcStore
s, this one keeps track of relationships with a third XvcStore<XvcEntity>
It lists which U
's are children of T
For example, a value of XvcPipeline
can have multiple XvcStep
These are represented with R1NStore<XvcPipeline, XvcStep>
This struct has parent-to-child
and child-to-parent
functions that can be used get children of a parent, or parent of child element.
The third type is RMNStore<T, U>
This one keeps arbitrary number of relationships between T
and U
Any number of T
s may correspond to any number of U
This type of store keeps the relationships in two XvcStore<XvcEntity>
Xvc Pipelines State Machine
Xvc pipelines use a state machine to track the progress of each step. Each step has a state that is updated as the pipeline is executed.
A step starts in the Begin
It must wait for all its dependency steps if --when
is set to by_dependencies
(the default) in xvc pipeline step new
or xvc pipeline step update
If this option is set to never
, the step will never run and will move to the DoneWithoutRunning
state just after begin.
If this option is set to always
, the step will run regardless of the changes in the dependencies and will move to the
even if dependencies are missing, broken, or have not changed.
If --when
option is set to by_dependencies
, the steps check the following conditions before running:
- All dependency steps must be in the
state. - There should be no missing dependency files.
- There should be no broken dependency processes.
- Dependency files should be newer, or the content digest should be different from the step outputs.
If any of these conditions are met, the step will move to the WaitingDependencySteps
To avoid unnecessary work, we need to find differences across versions.
What has changed between the previous version and this version of type T
Xvc is built bottom up, with vertical, long functions that do one thing.
For example, xvc file track
is written separately from xvc file recheck
, and the commonalities have arisen after these implementations.
I didn't start from traits and try to fit everything to a model. Instead, we began from concrete enums and structs. Then saw some of these share common functionality and thought to group this common functionality as a trait after implementing and refactoring concrete functions.
I saw that the diff
pattern across all comparison functions.
In xvc pipeline
, dependencies need to detect changes to decide whether to invalidate them.
In xvc file
, files and directories need to detect changes to decide whether they should be carried into the cache.
It's easy to make comparison/subtraction when the data types are numeric.
For a signed integer, you can get a single numeric value as diff with diff = a - b
For complex data structures, representing the change is not straightforward.
We keep track of everything in the repository in stores.
These serialize a type T
to a file, and get it back when needed.
Diff pattern works with these types.
Sometimes, there happens to be no record of something we have in the repository.
Sometimes, we only have only the record, and not the actual thing on disk.
The diff should also handle this.
Instead of trying to come up with wizardry, we decided to represent this with five conditions.
: When two things of the same typeT
are equal. Nothing has changed between the actual version and its record. -
RecordMissing { actual: T }
: If we have something on workspace, but can't find the respective record. For example, a new file is added to the workspace, butxvc file track
detects it for the first time. -
ActualMissing { record: T }
: We found a record in the store, but the corresponding file in the workspace is not where it should be. For example, a tracked file is deleted by the user, but the record is still there. -
Difference { record: T, actual: T }
: There is a record, but the actual file in workspace isn't identical with it. When a tracked file is changed, and its content now returns a different value, this can be reflected withDifference
. -
: When the comparison seems unnecessary or irrelevant. For example, if we know a file hasn't changed by checking its metadata. In this case, we don't calculate its content digest and set it toSkipped
These five conditions are represented in Diff
As an entity may have more than one component, a comparison may require multiple Diff
For example, we may want to compare an XvcPath
, to see whether it has changed.
This requires comparing its XvcMetadata
, its ContentDigest
if it's a file, its CollectionDigest
if it's a
directory, etc.
Xvc uses storages to store content of the files. These storages are different from Git remotes. They don't contain Git history of a repository, but they can store contents of the files tracked by Xvc.
A storage uses the same content-addresses used in Xvc cache to store the files.
For example, if there is a file in Xvc repository that points to /b3/1886572424...defa/0.png
in local cache, this path will be used to identify the content in storage as well.
Additionally, Xvc stores storage event logs that lists which operations are performed on that storage. By using these event logs, it's possible to identify what has gone on with storages without checking the file lists. These event logs are also shared with the other users, and a user can identify which files are present in a storage even without a connection.
Basic Operations
All storages should support the following operations:
- Init to initialize a storage
- List to list the files available in the storage.
- Send to upload files from local cache to a storage.
- Receive to download files from a storage to local cache.
- Delete to delete file from a storage.
All these operations record a distinct event to the event log.
Events record the event, guid of the storage and the event content.
Event contents are like the following:
- Init creates the necessary directories and the guid file in a storage
- List includes the listing got from the storage. Once a list is retrieved from the storage, it's available for local operations. Most recent lists are starting point to determine files available in a storage.
- Send event contains the affected paths. These paths are added to storage file list.
- Receive event contains the affected paths. These paths are added to storage file list.
- Delete to delete multiple files at once. These paths are removed from storage file list.
Storage types
Local Storages
A local storage is a directory in the local file system. It may be a mount point shared with others, or another disk that you use for backups and sharing.
- Init uses
to copy the GUID file to the appropriate directory - List uses
. - Send uses
with rayon. - Receive uses
with rayon. - Delete uses
with rayon.
Generic Storages
These storages define commands for each of the operations listed above.
It allows to run external programs such as rsync
, rclone
, s5cmd
For such storages, commands for the above operations must be defined and they will be run in separate processes.
This storage type offloads the responsibility of exact operations to the user.
The user is expected to supply the value following variables:
: The url for the storage. This can be anything the commands to send/receive/list will accept. It's to build the paths with minor repeats. -
: You can separate the storage directory. -
: This is set by Xvc for each singular commands. It's a relative path to the local cache directory. -
: This value is used to set the number of processes to perform operations. Setting this to1
makes all operations sequential. -
List Command
: A command to list the{URL}
. For example, forrsync --list-only {URL}{STORAGE_DIR}
Send Command
: A command to send a file to{URL}{STORAGE_DIR}
. It can use{URL}
and should use{PATH}
in the command. An example may bersync -a {PATH} {URL}{STORAGE_DIR}{PATH}
Receive Command
: A command to receive a file from a storage. It can use{URL}
, and should use{PATH}
in the command. Example:rsync -a {URL}{STORAGE_DIR}{PATH} {PATH}
Delete Command
: A command to delete a file from the storage. It can use{URL}
, and should use{PATH}
in the command. Example:ssh {URL} "rm {STORAGE_DIR}{PATH}"
Generic storages use these commands to create multiple processes to send/receive/delete files. It's not as fast as using other types because of the overhead involved, but its flexibility is useful.
Git and Xvc
Xvc aims to fill the gap Git leaves for certain workflows. These workflows involve large binary data that shouldn't be replicated in each repository.
Xvc tracks all its metadata on top of Git. In most cases, Xvc assumes the presence of a Git repository where the user tracks the history, text files, and metadata. However, the relationship between these should be clear and separate.
Xvc doesn't (and shouldn't) use Git more than a user could use manually. Our aim is not to replace Git operations with Xvc operations or tamper with the internal structure of the Git repository. When Xvc uses Git to track ECS or other metadata, the operations must be separate and sandwich Xvc operations.
Any Git operation that involves to checkout commits, branches, tags, or other references must come before any Xvc operation. As Xvc relies on the files tracked by Git, resuming any state for Xvc operations should be complete before these operations start.
Xvc helps to stage and commit certain files in
to Git. By default, any state-changing operation in Xvc adds a commit to Git. -
Xvc also helps to store this changed metadata in a new or existing branch. In this case, a checkout must be done before Xvc records the files.
Note that if the user has some already staged files, these are stashed and unstashed to the requested branch.
This is a side effect of doing xvc commit operations on behalf of the user.
The other option is to report an error and quit if the user has the --to-branch
option set.
The behavior may change in the future.
For the time being, we will keep this stash-unstash operation for the user files.
One other issue is the library that we're going to use. I checked several options when I was writing auto-commit functionality.
At that time, I decided that the number of Git operations for each Xvc operation is less than five.
These can be done by creating a Git process.
The libraries are not 100% identical in features.
Even the most widely used libgit2 doesn't provide shallow clones, or it's not possible to use git stash --staged
The second reason for this is explainability. Instead of trying to explain to the user what we are doing with Git, we can report the commands we are running. The library interfaces are different from Git CLI. They need to be learned before reading the code. Using Git CLI is more dependable, observable, and understandable than trying to come up with a set of library calls.
- Digest: A digest is a 32-byte numeric sequence to identify a file, content or any other data. Xvc uses different algorithms to generate this sequence.
- Associated Digest: This is a specific kind of digest associated with an entity. An entity can have more than one digests, like content digest or metadata digest. Xvc uses these different kinds of digests to avoid unnecessary digest calculations.
- Recheck: Recheck is the process of linking a file to its copy in Xvc cache. Xvc uses different methods to recheck a file, like copy, symlink, hardlink or reflink.
- Workspace: A project is broadly divided into 3 different types of directories.
contains the cache and metadata of the tracked files and pipelines,.git/
contains the git repository and the workspace contains the files that are tracked by either Xvc or git. It's the place where you do your work. - Carry-In: Carry-in is the process of adding a new version of a file to Xvc cache. It's analogous to
git commit
A numerical summary of an entity. In Xvc digests are 32-bytes, and produced by BLAKE3 by default.
See Associated Digest for different types of digests.
Associated Digest
There may be multiple digests associated with an entity like path, directory or dependency. An associated digest is all digests associated with an entity.
Metadata Digest
Files and directories have metadata.
Metadata shows information about creation, modification, access time of the file, or the size of it.
Metadata is OS dependent in most cases.
Xvc abstracts file and directory metadata with XvcMetadata
Metadata digest represents this abstraction in 32-bytes to compare changes in files and directories.
Content Digest
The content digest of a file is calculated by the data it contains. It calculates 32-bytes from the content. When content changes, this calculation result also change.
Collection Digest
Some entities in Xvc are composed of multiple elements. Examples are directories (composed of files), file lines, regex filter results, SQL query results etc. Instead trying to compare all elements, Xvc creates a 32-byte digest of the collection with the same conditions. For example, when a new file is added to a directory, its collection digest also changes. This is used keep track of changed directories easier than moving members around.
Code and Documentation Conventions
- Xvc is spelled capitalized in documentation. It's Xvc, not XVC, not xvc.