Content addressing
Snapdir is a content-addressed system: every piece of data is named by a cryptographic hash of its own bytes, not by a path, a filename, or a timestamp. The address is the content's fingerprint. Change a single byte and the address changes; leave the bytes untouched and the address is stable forever, on any machine, in any store.
This one idea is what makes snapdir snapshots deterministic, deduplicating, and self-verifying.
The hash is the name
By default snapdir hashes content with BLAKE3, producing a 256-bit digest rendered as a 64-character lowercase hex string. Two kinds of things get a content address:
-
Objects — the raw bytes of every regular file. An object's address is the BLAKE3 hash of its contents.
-
Manifests — the text document describing a whole directory tree. A manifest's address is its snapshot ID, the BLAKE3 hash of the manifest text.
Because the name is derived purely from the bytes, the mapping is
deterministic. Hashing foo\n always yields
49dc870df1de7fd60794cebce449f5ccdae575affaa67a24b62acb03e039db92, and the
empty file always yields
af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 — whether the
snapshot was produced today, last year, or by any tool that follows the same
manifest spec. Identical content always lands at the same
address; different content (with overwhelming probability) lands at a different
one.
Deduplication falls out for free
Content addressing makes deduplication automatic rather than a feature bolted on top. If a file's bytes already exist at their address — in your local cache or in a remote store — there is nothing new to write. Snapdir's push path is skip-if-present: it only uploads objects whose content address is absent, so re-snapshotting a tree where only one file changed transfers just that one file's object.
Deduplication works at three levels at once:
-
Within a snapshot — two files with identical bytes share one object.
-
Across snapshots — unchanged files across revisions reuse the same objects; successive snapshots cost only the deltas.
-
Across trees and stores — any two directories that happen to share content share its storage, regardless of where they came from.
The same property extends to directories. A directory's checksum is derived from the deduplicated set of its children's checksums (see the merkle rule), so subtrees with identical content collapse to one address too.
Why a cryptographic hash
A content address is only trustworthy if collisions are infeasible to forge.
BLAKE3 is a modern cryptographic hash, so an address doubles as an
integrity guarantee: if the bytes you fetch hash back to the
address you asked for, they are — to cryptographic certainty — exactly the bytes
that were stored. Snapdir leans on this end to end, re-verifying every object's
hash on fetch. For interoperability with legacy tooling, the checksum function
can be switched to md5 or sha256, but BLAKE3 remains the default and the
basis of snapdir's security claims — see Integrity.
Where to go next
-
Manifests — how content addresses compose into a snapshot.
-
Stores and cache — how addressed objects are laid out and shared.
-
Integrity — how addresses are verified on every transfer.