ML dataset versioning

Pin training datasets and model artifacts to a content-addressed ID so every run is reproducible, every node trains on byte-identical data, and a dataset is only ever uploaded once.

The problem

Machine-learning pipelines are only as reproducible as their inputs. A "v3" dataset folder drifts as someone re-exports a few files; a model checkpoint gets copied to three buckets with no way to prove they match; a distributed training job pulls slightly different shards onto different nodes and nobody notices until the eval numbers stop lining up. Tagging datasets by name or timestamp does not tell you whether two copies are actually the same bytes, and re-uploading a multi-terabyte corpus every time one shard changes is slow and expensive.

What you actually want is an identifier that is the content: if two datasets have the same ID they are provably identical, and if a single record changes the ID changes too.

Why snapdir

A snapshot ID is the BLAKE3 hash of the dataset's manifest, so it depends only on content — not on filesystem timestamps, upload order, or which machine produced it. That gives ML teams exactly the properties they need:

  • Reproducible pins. Record the snapshot ID next to your hyperparameters. A run is reproducible because the dataset ID is part of its configuration, and the ID can be re-verified at any time.

  • Identical data, identical ID, on every node. Each training node pulls the same --id and snapdir re-hashes every object on fetch, so all nodes are guaranteed to train on the same bytes.

  • Replicate once. Objects are stored at content-addressed keys, so re-publishing a dataset after changing a few shards only uploads the shards that changed.

Walkthrough

Snapshot a prepared dataset and capture its ID. This is the value you commit alongside your experiment config:

dataset_id=$(snapdir push --store s3://ml-data/datasets ./data/imagenet-prep)
echo "$dataset_id"

On every training node, pull that exact ID. The pull fetches each object, re-hashes it against the manifest, and checks it out — a successful pull is a proof the node has the right data:

snapdir pull --store s3://ml-data/datasets --id "$dataset_id" ./data/train

To confirm two copies match, just compare IDs — no checksumming the whole tree by hand:

snapdir id ./data/train   # prints $dataset_id if and only if the bytes match

Model checkpoints work the same way. Snapshot the output directory and pin the result in your model registry:

model_id=$(snapdir push --store s3://ml-data/models ./checkpoints/run-42)

When you re-export the dataset after fixing a few records, re-push: only the changed objects move, and the new ID records that this is a different dataset.

Outcome

Every dataset and model is identified by what it contains, not by a mutable name. Experiments are reproducible because their inputs are pinned to verifiable IDs, distributed training nodes are guaranteed to agree on the data, and large corpora are uploaded once and deduplicated across versions and stores. To trace which dataset a model was trained on later, use snapdir ancestors over the recorded snapshot history.