Embeddings Module

The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.

Overview

The EmbeddingsManager class handles:

  • Creating vector embeddings from paper abstracts

  • Storing embeddings in ChromaDB

  • Semantic similarity search

  • Integration with LM Studio embedding models

Class Reference

Embeddings Module

This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.

The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.

class abstracts_explorer.embeddings.RateLimitedTransport(transport, requests_per_minute)[source]

Bases: BaseTransport

An httpx transport that enforces a maximum requests-per-minute rate.

Wraps an existing transport and sleeps between requests to stay within the configured rate limit.

Parameters:
  • transport (httpx.BaseTransport) – The underlying transport to delegate requests to.

  • requests_per_minute (int) – Maximum number of requests per minute. Must be > 0.

__init__(transport, requests_per_minute)[source]
handle_request(request)[source]

Send request after enforcing the minimum inter-request interval.

Return type:

Response

close()[source]

Close the underlying transport.

Return type:

None

exception abstracts_explorer.embeddings.EmbeddingsError[source]

Bases: Exception

Exception raised for embedding operations.

class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]

Bases: object

Manager for generating and storing text embeddings.

This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234

  • model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”

  • collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”

  • requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).

lm_studio_url

OpenAI-compatible API endpoint URL.

Type:

str

model_name

Embedding model name.

Type:

str

embedding_db

ChromaDB configuration - URL for HTTP service or path for local storage.

Type:

str

collection_name

ChromaDB collection name.

Type:

str

client

ChromaDB client instance. Connected automatically on first access.

Type:

chromadb.Client

collection

Active ChromaDB collection. Created automatically on first access (which also connects the client if not yet connected).

Type:

chromadb.Collection

Examples

>>> em = EmbeddingsManager()
>>> em.add_paper(paper_dict)  # connect() and create_collection() called automatically
>>> results = em.search_similar("machine learning", n_results=5)
>>> em.close()
__init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]

Initialize the EmbeddingsManager.

Parameters are optional and will use values from environment/config if not provided.

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint. If None, uses config value.

  • model_name (str, optional) – Name of the embedding model. If None, uses config value.

  • collection_name (str, optional) – Name of the ChromaDB collection. If None, uses config value.

  • requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).

property client: Any

Get the ChromaDB client, connecting automatically on first access.

Returns:

Initialized ChromaDB client instance.

Return type:

chromadb.Client

Raises:

EmbeddingsError – If connecting to ChromaDB fails.

property collection: Any

Get the ChromaDB collection, creating it automatically on first access.

Calling this property for the first time also triggers connect() if the client has not been initialized yet.

Returns:

Initialized ChromaDB collection.

Return type:

chromadb.Collection

Raises:

EmbeddingsError – If connecting to ChromaDB or creating the collection fails.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

When requests_per_minute is greater than 0 a RateLimitedTransport is wrapped around the default httpx transport and passed as the http_client argument so that every HTTP request is automatically throttled.

This lazy loading prevents API calls during test collection.

Returns:

Initialized OpenAI client instance.

Return type:

OpenAI

connect()[source]

Connect to ChromaDB.

Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.

Raises:

EmbeddingsError – If connection fails.

Return type:

None

close()[source]

Close the ChromaDB connection.

Does nothing if not connected.

Return type:

None

__enter__()[source]

Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit.

test_lm_studio_connection()[source]

Test connection to OpenAI-compatible API endpoint.

Returns:

True if connection is successful, False otherwise.

Return type:

bool

Examples

>>> em = EmbeddingsManager()
>>> if em.test_lm_studio_connection():
...     print("API is accessible")
generate_embedding(text)[source]

Generate embedding for a given text using OpenAI-compatible API.

Rate limiting (if configured via requests_per_minute) is handled transparently by the underlying httpx transport.

Parameters:

text (str) – Text to generate embedding for.

Returns:

Embedding vector.

Return type:

List[float]

Raises:

EmbeddingsError – If embedding generation fails.

Examples

>>> em = EmbeddingsManager()
>>> embedding = em.generate_embedding("Sample text")
>>> len(embedding)
4096
create_collection(reset=False)[source]

Create or get ChromaDB collection.

Parameters:

reset (bool, optional) – If True, delete existing collection and create new one, by default False

Raises:

EmbeddingsError – If collection creation fails or not connected.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.create_collection(reset=True)  # Reset existing collection
paper_exists(paper_id)[source]

Check if a paper already exists in the collection.

Parameters:

paper_id (int or str) – Unique identifier for the paper.

Returns:

True if paper exists in collection, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_exists("uid1")
False
>>> em.add_paper(paper_dict)
>>> em.paper_exists("uid1")
True
paper_needs_update(paper)[source]

Check if a paper needs to be updated in the collection.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

True if the paper needs to be updated, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"})
True
>>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."})
False
static embedding_text_from_paper(paper)[source]

Extract text for embedding from a paper dictionary.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

Text to be used for embedding.

Return type:

str

static parse_chromadb_metadata(metadata)[source]

Parse a raw ChromaDB metadata dict through the LightweightPaper model.

ChromaDB stores all values as strings (see add_paper()). This method converts a raw metadata dict into one with properly typed values by running it through prepare_chroma_db_paper_data() and then validating via LightweightPaper.

Parameters:

metadata (dict) – Raw metadata dictionary from ChromaDB.

Returns:

Metadata dictionary with values converted to their canonical types. Authors will be a list[str] and keywords a list[str].

Return type:

dict

Examples

>>> raw = {"title": "My Paper", "year": "2024", "original_id": "42",
...        "authors": "Alice;Bob", "abstract": "An abstract",
...        "session": "ML", "poster_position": "1",
...        "conference": "NeurIPS"}
>>> parsed = EmbeddingsManager.parse_chromadb_metadata(raw)
>>> parsed["year"]
2024
>>> parsed["authors"]
['Alice', 'Bob']

See also

LightweightPaper

Pydantic model used for validation.

prepare_chroma_db_paper_data

Converts ChromaDB string fields to proper types before validation.

add_paper(paper)[source]

Add a paper to the vector database.

Parameters:

paper (dict) – Dictionary containing paper information. Must follow the paper database schema.

Raises:

EmbeddingsError – If adding paper fails or collection not initialized.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)
search_similar(query, n_results=10, where=None)[source]

Search for similar papers using semantic similarity.

Parameters:
  • query (str) – Query text to search for.

  • n_results (int, optional) – Number of results to return, by default 10

  • where (dict, optional) – Metadata filter conditions.

Returns:

Search results containing ids, distances, documents, and metadatas.

Return type:

dict

Raises:

EmbeddingsError – If search fails or collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025})
>>> for i, paper_id in enumerate(results['ids'][0]):
...     print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")
get_collection_stats()[source]

Get statistics about the collection.

Returns:

Statistics including count, name, and metadata.

Return type:

dict

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> stats = em.get_collection_stats()
>>> print(f"Collection has {stats['count']} papers")
check_model_compatibility()[source]

Check if the current embedding model matches the one stored in the database.

Returns:

  • compatible: True if models match or no model is stored, False if they differ

  • stored_model: Name of the model stored in the database, or None if not set

  • current_model: Name of the current model

Return type:

tuple of (bool, str or None, str or None)

Raises:

EmbeddingsError – If database operations fail.

Examples

>>> em = EmbeddingsManager()
>>> compatible, stored, current = em.check_model_compatibility()
>>> if not compatible:
...     print(f"Model mismatch: stored={stored}, current={current}")
embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]

Embed papers from the database.

Reads papers from the database and generates embeddings for their abstracts.

Parameters:
  • where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)

  • progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.

  • force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False

Returns:

Number of papers successfully embedded.

Return type:

int

Raises:

EmbeddingsError – If database reading or embedding fails.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> count = em.embed_from_database()
>>> print(f"Embedded {count} papers")
>>> # Only embed accepted papers
>>> count = em.embed_from_database(where_clause="decision = 'Accept'")
search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None)[source]

Perform semantic search for papers using embeddings.

This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.

Parameters:
  • query (str) – Search query text

  • database (DatabaseManager) – Database manager for retrieving full paper details

  • limit (int, optional) – Maximum number of results to return, by default 10

  • sessions (list of str, optional) – Filter by paper sessions

  • years (list of int, optional) – Filter by publication years

  • conferences (list of str, optional) – Filter by conference names

Returns:

List of paper dictionaries with complete information

Return type:

list of dict

Raises:

EmbeddingsError – If search fails

Examples

>>> papers = em.search_papers_semantic(
...     "transformers in vision",
...     database=db,
...     limit=5,
...     years=[2024, 2025]
... )
find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]

Find papers within a specified distance from a custom search query.

This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.

Parameters:
  • database (DatabaseManager) – Database manager instance for retrieving paper details

  • query (str) – The search query text

  • distance_threshold (float, optional) – Euclidean distance radius, by default 1.1

  • conferences (list[str], optional) – Filter results to only include papers from these conferences

  • years (list[int], optional) – Filter results to only include papers from these years

Returns:

Dictionary containing: - query: str - The search query - query_embedding: list[float] - The generated embedding for the query - distance: float - The distance threshold used - papers: list[dict] - Papers within the distance radius with their distances - count: int - Number of papers found within the distance threshold - total_considered: int - Total number of papers matching the

conference/year filters (before distance filtering)

Return type:

dict

Raises:

EmbeddingsError – If embeddings collection is empty or operation fails

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> db = DatabaseManager()
>>> db.connect()
>>> results = em.find_papers_within_distance(db, "machine learning", 1.1)
>>> print(f"Found {results['count']} papers")
>>>
>>> # With filters
>>> results = em.find_papers_within_distance(
...     db, "deep learning", 1.1,
...     conferences=["NeurIPS"],
...     years=[2023, 2024]
... )
export_embeddings(conference, year)[source]

Export embeddings for a given conference and year to a JSON-serializable dict.

Parameters:
  • conference (str) – Conference name to export.

  • year (int) – Year to export.

Returns:

Dictionary containing ids, documents, metadatas, and embeddings lists. Embedding vectors are converted to plain Python lists so the returned dict is always JSON-serializable.

Return type:

dict

Raises:

EmbeddingsError – If the export fails.

import_embeddings(data, conference, year, batch_size=100)[source]

Import embeddings for a given conference and year from a dictionary.

Existing embeddings for the same conference and year are deleted before importing (replace semantics).

Parameters:
  • data (dict) – Dictionary with ids, documents, metadatas, and embeddings lists (as returned by export_embeddings()).

  • conference (str) – Conference name being imported.

  • year (int) – Year being imported.

  • batch_size (int) – Number of embeddings to add per batch.

Returns:

Number of embeddings imported.

Return type:

int

Raises:

EmbeddingsError – If the import fails.

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager

# Initialize with database paths
em = EmbeddingsManager(
    collection_name="papers"
)

Creating Embeddings

# Create embeddings from all papers in database
em.create_embeddings_from_db()

# Create embeddings for specific papers
papers = [
    {
        'id': 1,
        'title': 'Example Paper',
        'abstract': 'This is the abstract...',
        'year': 2025
    }
]
em.add_papers(papers)

Embedding Models

The module supports any embedding model available through LM Studio:

Configuring Model

# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file

# Or directly
em = EmbeddingsManager(
    model="text-embedding-nomic-embed-text-v1.5"
)

ChromaDB Integration

The module uses ChromaDB for vector storage:

Collection Structure

  • Documents: Paper abstracts

  • Metadata: Paper ID, title, year, etc.

  • Embeddings: Vector representations

  • IDs: Unique identifiers (paper_id)

Collection Management

# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")

# Clear collection
em.clear_collection()

# Check if paper exists
exists = em.paper_exists(paper_id=123)

Search Results Format

Search results are returned as a list of dictionaries:

[
    {
        'id': 'paper_123',
        'title': 'Paper Title',
        'abstract': 'Abstract text...',
        'year': 2025,
        'distance': 0.234,  # Lower is more similar
    },
    # ... more results
]

Performance Considerations

Batch Processing

Process papers in batches for better performance:

# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)

# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)

Caching

ChromaDB caches embeddings on disk:

  • Location: Specified by embedding_db_path

  • Persistence: Embeddings persist across sessions

  • Updates: Only new papers are embedded

Memory Usage

Embedding models can be memory-intensive:

  • Smaller models: ~1-2 GB RAM

  • Larger models: 4-8 GB RAM

  • Batch size affects peak memory usage

Error Handling

try:
    em.create_embeddings_from_db()
except requests.RequestException:
    print("LM Studio connection failed")
except Exception as e:
    print(f"Embedding error: {e}")

Best Practices

  1. Create embeddings once - They’re cached and reused

  2. Use appropriate batch sizes - Balance speed and memory

  3. Filter searches - Use metadata filters to narrow results

  4. Choose good models - Larger models are more accurate but slower

  5. Monitor LM Studio - Ensure it’s running and model is loaded