Embeddings Module

The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.

Overview

The EmbeddingsManager class handles:

  • Creating vector embeddings from paper abstracts

  • Storing embeddings in ChromaDB

  • Semantic similarity search

  • Integration with LM Studio embedding models

Class Reference

This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.

The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.

class abstracts_explorer.embeddings.RateLimitedTransport(transport, requests_per_minute)[source]

Bases: BaseTransport

An httpx transport that enforces a maximum requests-per-minute rate.

Wraps an existing transport and acquires tokens from a shared token bucket rate limiter before forwarding requests. This provides thread-safe, global rate limiting that works correctly under Waitress’s multi-threaded WSGI server.

Parameters:
  • transport (httpx.BaseTransport) – The underlying transport to delegate requests to.

  • requests_per_minute (int) – Maximum number of requests per minute. Must be > 0.

__init__(transport, requests_per_minute)[source]
handle_request(request)[source]

Send request after acquiring a rate-limit token.

Return type:

Response

close()[source]

Close the underlying transport.

Return type:

None

class abstracts_explorer.embeddings.AsyncRateLimitedTransport(transport, requests_per_minute)[source]

Bases: AsyncBaseTransport

An async httpx transport that enforces a maximum requests-per-minute rate.

Wraps an existing async transport and acquires tokens from a shared token bucket rate limiter before forwarding requests. This provides thread-safe, global rate limiting that works with Pydantic AI’s async OpenAI provider.

Parameters:
  • transport (httpx.AsyncBaseTransport) – The underlying async transport to delegate requests to.

  • requests_per_minute (int) – Maximum number of requests per minute. Must be > 0.

__init__(transport, requests_per_minute)[source]
async handle_async_request(request)[source]

Send request after acquiring a rate-limit token asynchronously.

Return type:

Response

async aclose()[source]

Close the underlying async transport.

Return type:

None

exception abstracts_explorer.embeddings.EmbeddingsError[source]

Bases: Exception

Exception raised for embedding operations.

class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]

Bases: object

Manager for generating and storing text embeddings.

This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234

  • model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”

  • collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”

  • requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).

Variables:
  • lm_studio_url (str) – OpenAI-compatible API endpoint URL.

  • model_name (str) – Embedding model name.

  • embedding_db (str) – ChromaDB configuration - URL for HTTP service or path for local storage.

  • collection_name (str) – ChromaDB collection name.

  • client (chromadb.Client) – ChromaDB client instance. Connected automatically on first access.

  • collection (chromadb.Collection) – Active ChromaDB collection. Created automatically on first access (which also connects the client if not yet connected).

Examples

>>> em = EmbeddingsManager()
>>> em.add_paper(paper_dict)  # connect() and create_collection() called automatically
>>> results = em.search_similar("machine learning", n_results=5)
>>> em.close()
__init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]

Initialize the EmbeddingsManager.

Parameters are optional and will use values from environment/config if not provided.

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint. If None, uses config value.

  • model_name (str, optional) – Name of the embedding model. If None, uses config value.

  • collection_name (str, optional) – Name of the ChromaDB collection. If None, uses config value.

  • requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).

property client: Any

Get the ChromaDB client, connecting automatically on first access.

Returns:

Initialized ChromaDB client instance.

Return type:

chromadb.Client

Raises:

EmbeddingsError – If connecting to ChromaDB fails.

property collection: Any

Get the ChromaDB collection, creating it automatically on first access.

Calling this property for the first time also triggers connect() if the client has not been initialized yet.

Returns:

Initialized ChromaDB collection.

Return type:

chromadb.Collection

Raises:

EmbeddingsError – If connecting to ChromaDB or creating the collection fails.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

When requests_per_minute is greater than 0 a RateLimitedTransport is wrapped around the default httpx transport and passed as the http_client argument so that every HTTP request is automatically throttled.

This lazy loading prevents API calls during test collection.

Returns:

Initialized OpenAI client instance.

Return type:

OpenAI

connect()[source]

Connect to ChromaDB.

Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.

Raises:

EmbeddingsError – If connection fails.

Return type:

None

close()[source]

Close the ChromaDB connection.

Does nothing if not connected.

Return type:

None

__enter__()[source]

Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit.

test_lm_studio_connection()[source]

Test connection to OpenAI-compatible API endpoint.

Returns:

True if connection is successful, False otherwise.

Return type:

bool

Examples

>>> em = EmbeddingsManager()
>>> if em.test_lm_studio_connection():
...     print("API is accessible")
generate_embedding(text)[source]

Generate embedding for a given text using OpenAI-compatible API.

Rate limiting (if configured via requests_per_minute) is handled transparently by the underlying httpx transport.

Parameters:

text (str) – Text to generate embedding for.

Returns:

Embedding vector.

Return type:

List[float]

Raises:

EmbeddingsError – If embedding generation fails.

Examples

>>> em = EmbeddingsManager()
>>> embedding = em.generate_embedding("Sample text")
>>> len(embedding)
4096
create_collection(reset=False)[source]

Create or get ChromaDB collection.

Parameters:

reset (bool, optional) – If True, delete existing collection and create new one, by default False

Raises:

EmbeddingsError – If collection creation fails or not connected.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.create_collection(reset=True)  # Reset existing collection
paper_exists(paper_id)[source]

Check if a paper already exists in the collection.

Parameters:

paper_id (int or str) – Unique identifier for the paper.

Returns:

True if paper exists in collection, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_exists("uid1")
False
>>> em.add_paper(paper_dict)
>>> em.paper_exists("uid1")
True
paper_needs_update(paper)[source]

Check if a paper needs to be updated in the collection.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

True if the paper needs to be updated, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"})
True
>>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."})
False
static embedding_text_from_paper(paper)[source]

Extract text for embedding from a paper dictionary.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

Text to be used for embedding.

Return type:

str

static parse_chromadb_metadata(metadata)[source]

Parse a raw ChromaDB metadata dict through the LightweightPaper model.

ChromaDB stores all values as strings (see add_paper()). This method converts a raw metadata dict into one with properly typed values by running it through prepare_chroma_db_paper_data() and then validating via LightweightPaper.

Parameters:

metadata (dict) – Raw metadata dictionary from ChromaDB.

Returns:

Metadata dictionary with values converted to their canonical types. Authors will be a list[str] and keywords a list[str].

Return type:

dict

Examples

>>> raw = {"title": "My Paper", "year": "2024", "original_id": "42",
...        "authors": "Alice;Bob", "abstract": "An abstract",
...        "session": "ML", "poster_position": "1",
...        "conference": "NeurIPS"}
>>> parsed = EmbeddingsManager.parse_chromadb_metadata(raw)
>>> parsed["year"]
2024
>>> parsed["authors"]
['Alice', 'Bob']

See also

LightweightPaper

Pydantic model used for validation.

prepare_chroma_db_paper_data

Converts ChromaDB string fields to proper types before validation.

add_paper(paper)[source]

Add a paper to the vector database.

Parameters:

paper (dict) – Dictionary containing paper information. Must follow the paper database schema.

Raises:

EmbeddingsError – If adding paper fails or collection not initialized.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)
search_similar(query, n_results=10, where=None, ids=None)[source]

Search for similar papers using semantic similarity.

Parameters:
  • query (str) – Query text to search for.

  • n_results (int, optional) – Number of results to return, by default 10

  • where (dict, optional) – Metadata filter conditions.

Returns:

Search results containing ids, distances, documents, and metadatas.

Return type:

dict

Raises:

EmbeddingsError – If search fails or collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025})
>>> for i, paper_id in enumerate(results['ids'][0]):
...     print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")
get_collection_stats()[source]

Get statistics about the collection.

Returns:

Statistics including count, name, and metadata.

Return type:

dict

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> stats = em.get_collection_stats()
>>> print(f"Collection has {stats['count']} papers")
check_model_compatibility()[source]

Check if the current embedding model matches the one stored in the database.

Returns:

  • compatible: True if models match or no model is stored, False if they differ

  • stored_model: Name of the model stored in the database, or None if not set

  • current_model: Name of the current model

Return type:

tuple of (bool, str or None, str or None)

Raises:

EmbeddingsError – If database operations fail.

Examples

>>> em = EmbeddingsManager()
>>> compatible, stored, current = em.check_model_compatibility()
>>> if not compatible:
...     print(f"Model mismatch: stored={stored}, current={current}")
embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]

Embed papers from the database.

Reads papers from the database and generates embeddings for their abstracts.

Parameters:
  • where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)

  • progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.

  • force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False

Returns:

Number of papers successfully embedded.

Return type:

int

Raises:

EmbeddingsError – If database reading or embedding fails.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> count = em.embed_from_database()
>>> print(f"Embedded {count} papers")
>>> # Only embed accepted papers
>>> count = em.embed_from_database(where_clause="decision = 'Accept'")
search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None, distance_threshold=1.1)[source]

Perform semantic search for papers using embeddings.

This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.

Supports field:"value" syntax in the query for filtering by any Paper model column. Recognised filters are resolved against the SQL database (using ILIKE substring matching), and only the matching paper UIDs are forwarded to ChromaDB as a {"uid": {"$in": …}} condition. The remaining query text is used for the semantic similarity search.

In addition, the query text is always checked against the authors field in the SQL database (unless an explicit authors: filter is already present in the query). Papers that match the query as an author name are prepended to the results so that author matches appear first.

Parameters:
  • query (str) – Search query text. May include field:"value" filters.

  • database (DatabaseManager) – Database manager for retrieving full paper details

  • limit (int, optional) – Maximum number of results to return, by default 10

  • sessions (list of str, optional) – Filter by paper sessions

  • years (list of int, optional) – Filter by publication years

  • conferences (list of str, optional) – Filter by conference names

  • distance_threshold (float, optional) – Maximum distance (in embedding space) for a result to be included. Papers with a distance greater than this value are excluded from the results. By default 1.1, matching the threshold used by count_papers_within_distance().

Returns:

List of paper dictionaries with complete information

Return type:

list of dict

Raises:

EmbeddingsError – If search fails

Examples

>>> papers = em.search_papers_semantic(
...     "transformers in vision",
...     database=db,
...     limit=5,
...     years=[2024, 2025]
... )
>>> papers = em.search_papers_semantic(
...     'authors:"Vaswani" attention',
...     database=db,
... )
count_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]

Count papers within a distance threshold.

Delegates to find_papers_within_distance() and returns only the count of matching papers.

Parameters:
  • database (DatabaseManager) – Database manager instance for retrieving paper details.

  • query (str) – The search query text.

  • distance_threshold (float, optional) – Euclidean distance radius, by default 1.1.

  • conferences (list[str], optional) – Filter results to only include papers from these conferences.

  • years (list[int], optional) – Filter results to only include papers from these years.

Returns:

Number of papers within the distance threshold.

Return type:

int

Raises:

EmbeddingsError – If embeddings collection is empty or operation fails.

find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None, query_embedding=None)[source]

Find papers within a specified distance from a custom search query.

This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.

Parameters:
  • database (DatabaseManager) – Database manager instance for retrieving paper details

  • query (str) – The search query text

  • distance_threshold (float, optional) – Euclidean distance radius, by default 1.1

  • conferences (list[str], optional) – Filter results to only include papers from these conferences

  • years (list[int], optional) – Filter results to only include papers from these years

  • query_embedding (list[float], optional) – Pre-computed embedding for the query. When provided, the embedding generation step is skipped, which avoids redundant LLM API calls when calling this method repeatedly with the same query (e.g. once per year in topic-evolution analysis).

Returns:

Dictionary containing:

  • query: str - The search query

  • query_embedding: list[float] - The generated embedding for the query

  • distance: float - The distance threshold used

  • papers: list[dict] - Papers within the distance radius with their distances

  • count: int - Number of papers found within the distance threshold

  • total_considered: int - Total number of papers matching the conference/year filters (before distance filtering)

Return type:

dict

Raises:

EmbeddingsError – If embeddings collection is empty or operation fails

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> db = DatabaseManager()
>>> db.connect()
>>> results = em.find_papers_within_distance(db, "machine learning", 1.1)
>>> print(f"Found {results['count']} papers")
>>>
>>> # With filters
>>> results = em.find_papers_within_distance(
...     db, "deep learning", 1.1,
...     conferences=["NeurIPS"],
...     years=[2023, 2024]
... )
delete_embeddings_by_filter(conference=None, year=None)[source]

Delete embeddings that match the given conference and/or year filter.

Only embeddings whose metadata matches all supplied criteria are removed. At least one of conference or year must be provided; calling this method with both set to None raises ValueError to prevent accidental deletion of the entire collection.

Parameters:
  • conference (str, optional) – Conference name to match (exact, case-sensitive, as stored in ChromaDB metadata).

  • year (int, optional) – Publication year to match.

Returns:

Number of embeddings deleted.

Return type:

int

Raises:

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> deleted = em.delete_embeddings_by_filter(conference="NeurIPS", year=2024)
>>> print(f"Deleted {deleted} embeddings")
export_embeddings(conference, year)[source]

Export embeddings for a given conference and year to a JSON-serializable dict.

Parameters:
  • conference (str) – Conference name to export.

  • year (int) – Year to export.

Returns:

Dictionary containing ids, documents, metadatas, and embeddings lists. Embedding vectors are converted to plain Python lists so the returned dict is always JSON-serializable.

Return type:

dict

Raises:

EmbeddingsError – If the export fails.

import_embeddings(data, conference, year, batch_size=100)[source]

Import embeddings for a given conference and year from a dictionary.

Existing embeddings for the same conference and year are deleted before importing (replace semantics).

Parameters:
  • data (dict) – Dictionary with ids, documents, metadatas, and embeddings lists (as returned by export_embeddings()).

  • conference (str) – Conference name being imported.

  • year (int) – Year being imported.

  • batch_size (int) – Number of embeddings to add per batch.

Returns:

Number of embeddings imported.

Return type:

int

Raises:

EmbeddingsError – If the import fails.

update_paper_metadata(updates)[source]

Update metadata fields for existing papers without changing their embeddings.

Fetches the current metadata for each UID, merges the supplied field updates, re-serialises the result and calls collection.update. Papers whose UIDs are not found in the collection are silently skipped.

Parameters:

updates (dict) – Mapping of paper UID → dict of metadata field → new value. Only the keys present in each inner dict are modified; all other metadata fields are preserved.

Returns:

Number of papers whose metadata was actually updated.

Return type:

int

Raises:

EmbeddingsError – If fetching or updating the collection fails.

Examples

>>> em = EmbeddingsManager()
>>> em.update_paper_metadata({
...     "abc123": {"paper_pdf_url": "https://example.com/paper.pdf"}
... })
1

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager

# Initialize with database paths
em = EmbeddingsManager(
    collection_name="papers"
)

Creating Embeddings

# Create embeddings from all papers in database
em.create_embeddings_from_db()

# Create embeddings for specific papers
papers = [
    {
        'id': 1,
        'title': 'Example Paper',
        'abstract': 'This is the abstract...',
        'year': 2025
    }
]
em.add_papers(papers)

Embedding Models

The module supports any embedding model available through LM Studio:

Configuring Model

# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file

# Or directly
em = EmbeddingsManager(
    model="text-embedding-nomic-embed-text-v1.5"
)

ChromaDB Integration

The module uses ChromaDB for vector storage:

Collection Structure

  • Documents: Paper abstracts

  • Metadata: Paper ID, title, year, etc.

  • Embeddings: Vector representations

  • IDs: Unique identifiers (paper_id)

Collection Management

# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")

# Clear collection
em.clear_collection()

# Check if paper exists
exists = em.paper_exists(paper_id=123)

Search Results Format

Search results are returned as a list of dictionaries:

[
    {
        'id': 'paper_123',
        'title': 'Paper Title',
        'abstract': 'Abstract text...',
        'year': 2025,
        'distance': 0.234,  # Lower is more similar
    },
    # ... more results
]

Performance Considerations

Batch Processing

Process papers in batches for better performance:

# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)

# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)

Caching

ChromaDB caches embeddings on disk:

  • Location: Specified by embedding_db_path

  • Persistence: Embeddings persist across sessions

  • Updates: Only new papers are embedded

Memory Usage

Embedding models can be memory-intensive:

  • Smaller models: ~1-2 GB RAM

  • Larger models: 4-8 GB RAM

  • Batch size affects peak memory usage

Error Handling

try:
    em.create_embeddings_from_db()
except requests.RequestException:
    print("LM Studio connection failed")
except Exception as e:
    print(f"Embedding error: {e}")

Best Practices

  1. Create embeddings once - They’re cached and reused

  2. Use appropriate batch sizes - Balance speed and memory

  3. Filter searches - Use metadata filters to narrow results

  4. Choose good models - Larger models are more accurate but slower

  5. Monitor LM Studio - Ensure it’s running and model is loaded