Parquet feature mirrors for inference

Serve long-running inference features as ordinary local Parquet files, while pinning the exact partition set by snapshot ID.

The problem

Inference endpoints often need feature tables that are larger and slower-moving than the model itself: daily partitions, window aggregates, reference data, and other signals that arrive over time. Keeping those files only in GCS makes every runtime depend on remote reads. Copying whole directories locally makes updates clumsy, and stale files can stay visible after a partition is removed.

What the runtime needs is simpler: a local directory containing exactly the Parquet files the application should consider, updated when a new snapshot is published.

Why snapdir

Snapdir gives the feature set a content ID, transfers only missing objects, and can expose a local snapshot as read-only symlinks. The endpoint reads files from disk; deployment controls which snapshot ID is visible.

Walkthrough

A publisher prepares the partition set for the endpoint and pushes it to a GCS snapdir store:

features_id=$(snapdir push --store gs://ml-feature-snapshots ./features)

Each inference host keeps a local file store warm. Re-running sync copies only objects that are missing locally, so a new daily partition does not re-copy old partitions:

snapdir sync \
  --id "$features_id" \
  --from gs://ml-feature-snapshots \
  --to file:///srv/snapdir/features

Then expose that snapshot as the directory the service reads:

snapdir pull \
  --store file:///srv/snapdir/features \
  --id "$features_id" \
  --linked --delete \
  /srv/inference/features

--linked makes the files in /srv/inference/features read-only symlinks into the local content-addressed store. --delete removes symlinks for partitions that are no longer in the snapshot, so the endpoint sees only the current feature set.

Outcome

The service uses normal local Parquet tooling. Snapdir handles distribution and integrity: new partitions are fetched once per host, unchanged partitions stay in the local store, and every runtime can report the snapshot ID it is serving. Snapdir verifies file bytes, not Parquet schema or row-level semantics, so those checks still belong in the feature pipeline.