Registry Module

The registry module provides functionality to upload and download Abstracts Explorer data artifacts to and from OCI-compatible container registries such as GitHub Container Registry (ghcr.io).

Overview

Data artifacts include:

  • Paper database — conference papers with metadata

  • Embeddings — ChromaDB vector embeddings

  • Clustering cache — pre-computed clustering results

Artifacts are tagged by conference, year, and embedding model (e.g., neurips-2024_text-embedding-qwen3-embedding-4b).

Quick Start

from abstracts_explorer.registry import RegistryClient

# Initialize client
client = RegistryClient(
    repository="ghcr.io/thawn/abstracts-data",
    token="ghp_xxxxxxxxxxxxxxxxxxxx"
)

# List available tags
tags = client.list_tags()
print(tags)

# Download data for a specific conference and year
client.download(conference="neurips", year=2024)

# Upload local data
client.upload(conference="neurips", year=2024)

See the Registry Guide for CLI usage and full documentation.

API Reference

Registry module for uploading and downloading data to/from OCI-compatible container registries.

This module provides functionality to push and pull abstracts-explorer data artifacts (paper databases, embedding databases, clustering caches) to OCI-compatible registries such as GitHub Container Registry (ghcr.io).

Artifacts are pushed and pulled using the oras Python SDK. Each artifact is tagged by conference (e.g. neurips) or by conference and year (e.g. neurips-2024). A conference-only tag contains all available years with each year stored as its own set of OCI layers (paper DB + embeddings + clustering cache).

Examples

Upload data for a specific year:

from abstracts_explorer.registry import RegistryClient

client = RegistryClient(
    repository="ghcr.io/thawn/abstracts-data",
    token="ghp_xxxx",
)
client.upload(conference="neurips", year=2024)

Upload all available years for a conference:

client.upload(conference="neurips")

Download data from the registry:

client.download(conference="neurips", year=2024)

List available tags:

tags = client.list_tags()
exception abstracts_explorer.registry.RegistryError[source]

Bases: Exception

Exception raised for registry operation errors.

exception abstracts_explorer.registry.EmbeddingModelMismatchError(local_model, remote_model)[source]

Bases: RegistryError

Raised when the embedding model in the local database does not match the model in the downloaded artifact.

Variables:
  • local_model (str) – Embedding model currently stored in the local database.

  • remote_model (str) – Embedding model used by the downloaded artifact.

__init__(local_model, remote_model)[source]
class abstracts_explorer.registry.RegistryClient(repository, token=None)[source]

Bases: object

Client for pushing and pulling data artifacts to/from OCI-compatible registries.

Uses the oras Python SDK to interact with OCI registries.

The smallest unit of upload/download is a conference + year combination. Each artifact always contains the paper database, embeddings, and clustering cache together to prevent inconsistent data.

When year is omitted, all available years for the conference are uploaded or downloaded, with each year stored as its own pair of OCI layers.

Parameters:
  • repository (str) – Full OCI repository path (e.g., ghcr.io/thawn/abstracts-data).

  • token (str, optional) – Authentication token (e.g., GitHub Personal Access Token). If not provided, will try the GITHUB_TOKEN environment variable.

Raises:

RegistryError – If the repository format is invalid.

Examples

>>> client = RegistryClient("ghcr.io/thawn/abstracts-data", token="ghp_xxxx")
>>> client.list_tags()
['neurips-2024', 'iclr-2025']
__init__(repository, token=None)[source]
list_tags()[source]

List available tags in the repository.

Returns:

Available tags.

Return type:

list of str

Raises:

RegistryError – If listing fails.

static clear_local_embedding_data()[source]

Clear all local embedding data — metadata, ChromaDB collection, and clustering cache.

This is a destructive operation that removes all embedding-related data from the local databases so that data with a different embedding model can be imported. Use with care.

After calling this method, the next download will import fresh data and establish a new embedding model association in the local database.

Return type:

None

upload(conference, year=None, tag=None, progress_callback=None)[source]

Upload data for a conference (and optionally a specific year) to the registry.

Packages the paper database, embeddings, and clustering cache as OCI layers and pushes them together. All three must be present for every year; an error is raised if any data is missing.

When year is not None, a single per-year tag is pushed (e.g. neurips-2024_model).

When year is None, every year available locally is first pushed as its own individual tag (e.g. neurips-2024_model, neurips-2025_model) and then an all-years summary tag (e.g. neurips_model) is pushed containing all years’ files as layers. Because OCI blobs are content-addressed, the registry deduplicates the files — no data is actually stored twice.

Parameters:
  • conference (str) – Conference name (e.g. neurips).

  • year (int, optional) – Conference year (e.g. 2024). When None, all available years are uploaded.

  • tag (str, optional) – Custom tag. If None, derived from embedding model, conference and year.

  • progress_callback (callable, optional) – Function called with status messages during upload.

Returns:

Upload summary with paper count, embedding count, years, tag, and (when multiple years) year_tags listing the per-year tags pushed.

Return type:

dict

Raises:

RegistryError – If upload fails or required data is missing.

download(conference, year=None, tag=None, embedding_model=None, progress_callback=None, ignore_embedding_model_mismatch=False)[source]

Download data for a conference (and optionally a specific year) from the registry.

Pulls the paper database, embeddings, and clustering cache and replaces existing local data for the specified conference and year(s).

When year is None, all years contained in the artifact are downloaded.

Parameters:
  • conference (str) – Conference name (e.g. neurips).

  • year (int, optional) – Conference year (e.g. 2024). When None, all years in the artifact are imported.

  • tag (str, optional) – Custom tag. If None, derived from embedding model, conference and year.

  • embedding_model (str, optional) – Embedding model name used for tag derivation. When None and tag is also None, the model is read from the EMBEDDING_MODEL configuration. A RegistryError is raised if the model cannot be determined.

  • progress_callback (callable, optional) – Function called with status messages during download.

  • ignore_embedding_model_mismatch (bool, optional) – When True, proceed with the download even if the artifact’s embedding model differs from the configured model. After a successful import the local embedding model metadata is updated to match embedding_model. Only use this option when the mismatch is caused by the same model having different names on different backends (e.g. LM Studio vs. Ollama). Default is False.

Returns:

Download summary with paper count and embedding count.

Return type:

dict

Raises:
  • EmbeddingModelMismatchError – If the artifact’s embedding model differs from embedding_model and ignore_embedding_model_mismatch is False.

  • RegistryError – If download fails or the embedding model cannot be determined.

upload_all(progress_callback=None)[source]

Upload data for all conferences available locally.

Each conference is uploaded as a separate OCI artifact with a conference-only tag (containing all years for that conference).

Parameters:

progress_callback (callable, optional) – Function called with status messages during upload.

Returns:

Upload summaries, one per conference.

Return type:

list of dict

Raises:

RegistryError – If no conferences are found or any upload fails.

download_all(progress_callback=None, ignore_embedding_model_mismatch=False)[source]

Download data for all conference tags in the registry.

Lists available tags and downloads every conference-level tag (i.e. tags without a year suffix).

Parameters:
  • progress_callback (callable, optional) – Function called with status messages during download.

  • ignore_embedding_model_mismatch (bool, optional) – If True, ignore embedding model mismatches during download.

Returns:

Download summaries, one per conference tag.

Return type:

list of dict

Raises:

RegistryError – If no tags are found or any download fails.

get_artifact_info(tag)[source]

Get metadata about a specific artifact tag.

Parameters:

tag (str) – Tag to inspect.

Returns:

Artifact metadata including version, conference, year, and counts.

Return type:

dict

Raises:

RegistryError – If the tag is not found or cannot be read.

delete_old_versions(below_version, conference=None, dry_run=False, progress_callback=None)[source]

Delete registry package versions whose tag version is older than below_version.

Uses the GitHub Packages API to list and delete container image versions. Only versions that carry at least one OCI tag matching the abstracts-explorer tag format ({conference}[-{year}]_{model}_{version}) are considered; untagged (dangling) versions are left untouched.

Parameters:
  • below_version (str) – Threshold version string (PEP 440). Versions strictly older than this value are deleted. Example: "0.4.0" deletes all versions tagged with a version < 0.4.0.

  • conference (str, optional) – When provided, only tags whose base component starts with conference (case-insensitive) are examined. Tags for other conferences are ignored.

  • dry_run (bool, optional) – When True, log which versions would be deleted but perform no actual deletions (default: False).

  • progress_callback (callable, optional) – Function called with status messages during the operation.

Returns:

One entry per deleted (or, in dry-run mode, would-be-deleted) version. Each dict contains version_id, tags, and version.

Return type:

list of dict

Raises:
  • RegistryError – If the registry is not hosted on ghcr.io, if the GitHub API call fails, or if below_version cannot be parsed.

  • ValueError – If below_version is not a valid PEP 440 version string.