Content addressing

Snapdir is a content-addressed system: every piece of data is named by a cryptographic hash of its own bytes, not by a path, a filename, or a timestamp. The address is the content's fingerprint. Change a single byte and the address changes; leave the bytes untouched and the address is stable forever, on any machine, in any store.

This one idea is what makes snapdir snapshots deterministic, deduplicating, and self-verifying.

The hash is the name

By default snapdir hashes content with BLAKE3, producing a 256-bit digest rendered as a 64-character lowercase hex string. Two kinds of things get a content address:

  • Objects — the raw bytes of every regular file. An object's address is the BLAKE3 hash of its contents.

  • Manifests — the text document describing a whole directory tree. A manifest's address is its snapshot ID, the BLAKE3 hash of the manifest text.

Because the name is derived purely from the bytes, the mapping is deterministic. Hashing foo\n always yields 49dc870df1de7fd60794cebce449f5ccdae575affaa67a24b62acb03e039db92, and the empty file always yields af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 — whether the snapshot was produced today, last year, or by any tool that follows the same manifest spec. Identical content always lands at the same address; different content (with overwhelming probability) lands at a different one.

Deduplication falls out for free

Content addressing makes deduplication automatic rather than a feature bolted on top. If a file's bytes already exist at their address — in your local cache or in a remote store — there is nothing new to write. Snapdir's push path is skip-if-present: it only uploads objects whose content address is absent, so re-snapshotting a tree where only one file changed transfers just that one file's object.

Deduplication works at three levels at once:

  • Within a snapshot — two files with identical bytes share one object.

  • Across snapshots — unchanged files across revisions reuse the same objects; successive snapshots cost only the deltas.

  • Across trees and stores — any two directories that happen to share content share its storage, regardless of where they came from.

The same property extends to directories. A directory's checksum is derived from the deduplicated set of its children's checksums (see the merkle rule), so subtrees with identical content collapse to one address too.

Why a cryptographic hash

A content address is only trustworthy if collisions are infeasible to forge. BLAKE3 is a modern cryptographic hash, so an address doubles as an integrity guarantee: if the bytes you fetch hash back to the address you asked for, they are — to cryptographic certainty — exactly the bytes that were stored. Snapdir leans on this end to end, re-verifying every object's hash on fetch. For interoperability with legacy tooling, the checksum function can be switched to md5 or sha256, but BLAKE3 remains the default and the basis of snapdir's security claims — see Integrity.

Where to go next

  • Manifests — how content addresses compose into a snapshot.

  • Stores and cache — how addressed objects are laid out and shared.

  • Integrity — how addresses are verified on every transfer.