Registry: Sharing Data via OCI Container Registries

The registry command group lets you exchange pre-built paper databases, embeddings, and clustering caches between different instances of Abstracts Explorer via any OCI-compatible container registry (e.g. GitHub Container Registry — ghcr.io).

This eliminates the need to re-download paper data, regenerate embeddings, and re-cluster papers on every new instance.

Overview

Data is packaged and pushed as OCI artifacts using the ORAS protocol. Each artifact tag identifies a conference, year, and embedding model, for example:

ghcr.io/thawn/abstracts-data:neurips-2024_text-embedding-qwen3-embedding-4b

Each uploaded artifact contains:

  • Paper database (papers-YYYY.db) — all papers, clustering cache, hierarchical labels, and embedding metadata for the given conference and year

  • Embeddings (embeddings-YYYY.json) — ChromaDB vector embeddings for all papers

Data for all years of a conference is bundled into a single tag (e.g. neurips_model-name) where each year is stored as its own pair of layers. The OCI registry deduplicates blobs by content hash, so no data is stored twice even when individual year tags and the all-years tag both exist.

Authentication

Registry operations require a GitHub Personal Access Token (PAT) with read:packages (for download) and write:packages (for upload) scopes.

You can provide the token in three ways:

1. .env file (recommended)

GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxx
REGISTRY_REPOSITORY=ghcr.io/thawn/abstracts-data

2. Environment variable

export GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxx

3. CLI flag

abstracts-explorer registry upload --token $GITHUB_TOKEN -r ghcr.io/thawn/abstracts-data ...

Commands

registry upload

Upload paper database and embeddings for a conference/year to the registry.

Usage:

abstracts-explorer registry upload [OPTIONS]

Options:

Option

Description

-r, --repository TEXT

OCI repository (e.g. ghcr.io/thawn/abstracts-data). Falls back to REGISTRY_REPOSITORY env var.

--token TEXT

Registry authentication token. Falls back to GITHUB_TOKEN env var.

-c, --conference TEXT

Conference name to upload (e.g. neurips). Use all to upload every conference. Case-insensitive.

-y, --year INTEGER

Year to upload. When omitted, all available years are uploaded.

--yes

Skip confirmation prompts (for CI/non-interactive use).

Upload validation:

  • Raises an error if no papers exist for the specified conference/year

  • Raises an error if no embeddings exist for the specified conference/year

  • Raises an error if no embedding model is recorded in the local database (run create-embeddings first)

Examples:

# Upload NeurIPS 2024 (tag derived automatically from embedding model)
abstracts-explorer registry upload \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips \
  --year 2024

# Upload all available NeurIPS years (pushes individual year tags then an all-years tag)
abstracts-explorer registry upload \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips

# Upload all conferences (non-interactive, for CI)
abstracts-explorer registry upload \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference all \
  --yes

registry download

Download paper database and embeddings for a conference/year from the registry.

Usage:

abstracts-explorer registry download [OPTIONS]

Options:

Option

Description

-r, --repository TEXT

OCI repository. Falls back to REGISTRY_REPOSITORY env var.

--token TEXT

Registry authentication token. Falls back to GITHUB_TOKEN env var.

-c, --conference TEXT

Conference name to download (e.g. neurips). Use all to download every available conference. Case-insensitive.

-y, --year INTEGER

Year to download. When omitted, all available years are downloaded.

--embedding-model TEXT

Embedding model name, used to derive the correct tag when no local data exists. Falls back to local DB metadata or EMBEDDING_MODEL env var.

--yes

Skip confirmation prompts (for CI/non-interactive use).

Download behavior:

  • Existing local data for the specified conference/year is replaced (no merge)

  • Both paper database and embeddings are always downloaded together to ensure consistency

  • If the downloaded data was created with a different embedding model than the local database, the download is rejected to prevent data corruption

Embedding model mismatch recovery:

If your local database uses a different embedding model from the one stored in the registry artifact, but the configured EMBEDDING_MODEL matches the remote model, the CLI offers to:

  1. Clear all local embeddings, clustering cache, and embedding metadata

  2. Retry the download with the new model

⚠️  Embedding model mismatch detected:
  Local database:  'model-a'
  Downloaded data: 'model-b'

The configured model ('model-b') matches the downloaded data.
To proceed, all local embeddings, clustering cache, and embedding metadata
must be cleared so the new model's data can be imported.
⚠️  This will delete ALL local embeddings and clustering cache!
Clear all local embedding data and retry download? [y/N]:

With --yes, this clear-and-retry happens automatically without prompting.

Examples:

# Download NeurIPS 2024
abstracts-explorer registry download \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips \
  --year 2024

# Download when no local data exists (specify embedding model explicitly)
abstracts-explorer registry download \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips \
  --year 2024 \
  --embedding-model text-embedding-qwen3-embedding-4b

# Download all available conferences (non-interactive, for CI)
abstracts-explorer registry download \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference all \
  --yes

registry list

List available tags in the registry, or inspect a specific tag’s metadata.

Usage:

abstracts-explorer registry list [OPTIONS]

Options:

Option

Description

-r, --repository TEXT

OCI repository. Falls back to REGISTRY_REPOSITORY env var.

--token TEXT

Registry authentication token. Falls back to GITHUB_TOKEN env var.

--tag TEXT

Inspect a specific tag and display its metadata (conference, year, embedding model, etc.).

Examples:

# List all available tags
abstracts-explorer registry list -r ghcr.io/thawn/abstracts-data

# Inspect a specific tag
abstracts-explorer registry list \
  -r ghcr.io/thawn/abstracts-data \
  --tag neurips-2024_text-embedding-qwen3-embedding-4b

Tag Format

Tags are automatically derived from the conference, year, and embedding model:

Data scope

Tag format

Example

Single year

{conference}-{year}_{model}

neurips-2024_text-embedding-qwen3-embedding-4b

All years

{conference}_{model}

neurips_text-embedding-qwen3-embedding-4b

The embedding model name is sanitized for OCI tag compatibility (lowercased, special characters replaced with hyphens, consecutive hyphens collapsed).

Typical Workflow

Uploading data from one instance

# 1. Download papers
abstracts-explorer download --conference neurips --year 2025

# 2. Generate embeddings
abstracts-explorer create-embeddings

# 3. (Optional) cluster embeddings to pre-populate the cache
abstracts-explorer cluster-embeddings

# 4. Upload to registry
abstracts-explorer registry upload \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips \
  --year 2025

Downloading data on another instance

# Download and import NeurIPS 2025 (no download/embedding generation needed!)
abstracts-explorer registry download \
  -r ghcr.io/thawn/abstracts-data \
  --token $GITHUB_TOKEN \
  --conference neurips \
  --year 2025

Data Integrity

The registry feature is designed for safe, conflict-free imports:

  • No merges: Downloads always replace existing data for the specified conference+year. This prevents ID conflicts between instances.

  • Atomic import: If either the paper DB or embedding import fails, any partial changes are rolled back.

  • Pre-validation: Both the paper DB and embeddings files must be present before any import begins.

  • Embedding model consistency: The local database rejects imported data created with a different embedding model, preventing silent data corruption.

  • Scoped cache invalidation: Only clustering cache entries for the imported conference+year are replaced; other conferences and years are unaffected.

Docker / Container Usage

When running Abstracts Explorer in Docker, you can pre-populate the databases by running a one-off download container before starting the main service, or by adding the registry credentials to your .env file and running the download inside the running container.

See Docker Setup for details.