Embeddings Module

The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.

Overview

The EmbeddingsManager class handles:

  • Creating vector embeddings from paper abstracts

  • Storing embeddings in ChromaDB

  • Semantic similarity search

  • Integration with LM Studio embedding models

Class Reference

Embeddings Module

This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.

The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.

exception abstracts_explorer.embeddings.EmbeddingsError[source]

Bases: Exception

Exception raised for embedding operations.

class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]

Bases: object

Manager for generating and storing text embeddings.

This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234

  • model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”

  • collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”

lm_studio_url

OpenAI-compatible API endpoint URL.

Type:

str

model_name

Embedding model name.

Type:

str

embedding_db

ChromaDB configuration - URL for HTTP service or path for local storage.

Type:

str

collection_name

ChromaDB collection name.

Type:

str

client

ChromaDB client instance.

Type:

chromadb.Client or None

collection

Active ChromaDB collection.

Type:

chromadb.Collection or None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)
>>> results = em.search_similar("machine learning", n_results=5)
>>> em.close()
__init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]

Initialize the EmbeddingsManager.

Parameters are optional and will use values from environment/config if not provided.

Parameters:
  • lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint. If None, uses config value.

  • model_name (str, optional) – Name of the embedding model. If None, uses config value.

  • collection_name (str, optional) – Name of the ChromaDB collection. If None, uses config value.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

This lazy loading prevents API calls during test collection.

Returns:

Initialized OpenAI client instance.

Return type:

OpenAI

connect()[source]

Connect to ChromaDB.

Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.

Raises:

EmbeddingsError – If connection fails.

Return type:

None

close()[source]

Close the ChromaDB connection.

Does nothing if not connected.

Return type:

None

__enter__()[source]

Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit.

test_lm_studio_connection()[source]

Test connection to OpenAI-compatible API endpoint.

Returns:

True if connection is successful, False otherwise.

Return type:

bool

Examples

>>> em = EmbeddingsManager()
>>> if em.test_lm_studio_connection():
...     print("API is accessible")
generate_embedding(text)[source]

Generate embedding for a given text using OpenAI-compatible API.

Parameters:

text (str) – Text to generate embedding for.

Returns:

Embedding vector.

Return type:

List[float]

Raises:

EmbeddingsError – If embedding generation fails.

Examples

>>> em = EmbeddingsManager()
>>> embedding = em.generate_embedding("Sample text")
>>> len(embedding)
4096
create_collection(reset=False)[source]

Create or get ChromaDB collection.

Parameters:

reset (bool, optional) – If True, delete existing collection and create new one, by default False

Raises:

EmbeddingsError – If collection creation fails or not connected.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.create_collection(reset=True)  # Reset existing collection
paper_exists(paper_id)[source]

Check if a paper already exists in the collection.

Parameters:

paper_id (int or str) – Unique identifier for the paper.

Returns:

True if paper exists in collection, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_exists("uid1")
False
>>> em.add_paper(paper_dict)
>>> em.paper_exists("uid1")
True
paper_needs_update(paper)[source]

Check if a paper needs to be updated in the collection.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

True if the paper needs to be updated, False otherwise.

Return type:

bool

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"})
True
>>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."})
False
static embedding_text_from_paper(paper)[source]

Extract text for embedding from a paper dictionary.

Parameters:

paper (dict) – Dictionary containing paper information.

Returns:

Text to be used for embedding.

Return type:

str

add_paper(paper)[source]

Add a paper to the vector database.

Parameters:

paper (dict) – Dictionary containing paper information. Must follow the paper database schema.

Raises:

EmbeddingsError – If adding paper fails or collection not initialized.

Return type:

None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)
search_similar(query, n_results=10, where=None)[source]

Search for similar papers using semantic similarity.

Parameters:
  • query (str) – Query text to search for.

  • n_results (int, optional) – Number of results to return, by default 10

  • where (dict, optional) – Metadata filter conditions.

Returns:

Search results containing ids, distances, documents, and metadatas.

Return type:

dict

Raises:

EmbeddingsError – If search fails or collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025})
>>> for i, paper_id in enumerate(results['ids'][0]):
...     print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")
get_collection_stats()[source]

Get statistics about the collection.

Returns:

Statistics including count, name, and metadata.

Return type:

dict

Raises:

EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> stats = em.get_collection_stats()
>>> print(f"Collection has {stats['count']} papers")
check_model_compatibility()[source]

Check if the current embedding model matches the one stored in the database.

Returns:

  • compatible: True if models match or no model is stored, False if they differ

  • stored_model: Name of the model stored in the database, or None if not set

  • current_model: Name of the current model

Return type:

tuple of (bool, str or None, str or None)

Raises:

EmbeddingsError – If database operations fail.

Examples

>>> em = EmbeddingsManager()
>>> compatible, stored, current = em.check_model_compatibility()
>>> if not compatible:
...     print(f"Model mismatch: stored={stored}, current={current}")
embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]

Embed papers from the database.

Reads papers from the database and generates embeddings for their abstracts.

Parameters:
  • where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)

  • progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.

  • force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False

Returns:

Number of papers successfully embedded.

Return type:

int

Raises:

EmbeddingsError – If database reading or embedding fails.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> count = em.embed_from_database()
>>> print(f"Embedded {count} papers")
>>> # Only embed accepted papers
>>> count = em.embed_from_database(where_clause="decision = 'Accept'")
search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None)[source]

Perform semantic search for papers using embeddings.

This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.

Parameters:
  • query (str) – Search query text

  • database (DatabaseManager) – Database manager for retrieving full paper details

  • limit (int, optional) – Maximum number of results to return, by default 10

  • sessions (list of str, optional) – Filter by paper sessions

  • years (list of int, optional) – Filter by publication years

  • conferences (list of str, optional) – Filter by conference names

Returns:

List of paper dictionaries with complete information

Return type:

list of dict

Raises:

EmbeddingsError – If search fails

Examples

>>> papers = em.search_papers_semantic(
...     "transformers in vision",
...     database=db,
...     limit=5,
...     years=[2024, 2025]
... )
find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]

Find papers within a specified distance from a custom search query.

This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.

Parameters:
  • database (DatabaseManager) – Database manager instance for retrieving paper details

  • query (str) – The search query text

  • distance_threshold (float, optional) – Euclidean distance radius, by default 1.1

  • conferences (list[str], optional) – Filter results to only include papers from these conferences

  • years (list[int], optional) – Filter results to only include papers from these years

Returns:

Dictionary containing: - query: str - The search query - query_embedding: list[float] - The generated embedding for the query - distance: float - The distance threshold used - papers: list[dict] - Papers within the distance radius with their distances - count: int - Number of papers found

Return type:

dict

Raises:

EmbeddingsError – If embeddings collection is empty or operation fails

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> db = DatabaseManager()
>>> db.connect()
>>> results = em.find_papers_within_distance(db, "machine learning", 1.1)
>>> print(f"Found {results['count']} papers")
>>>
>>> # With filters
>>> results = em.find_papers_within_distance(
...     db, "deep learning", 1.1,
...     conferences=["NeurIPS"],
...     years=[2023, 2024]
... )

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager

# Initialize with database paths
em = EmbeddingsManager(
    collection_name="papers"
)

Creating Embeddings

# Create embeddings from all papers in database
em.create_embeddings_from_db()

# Create embeddings for specific papers
papers = [
    {
        'id': 1,
        'title': 'Example Paper',
        'abstract': 'This is the abstract...',
        'year': 2025
    }
]
em.add_papers(papers)

Embedding Models

The module supports any embedding model available through LM Studio:

Configuring Model

# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file

# Or directly
em = EmbeddingsManager(
    model="text-embedding-nomic-embed-text-v1.5"
)

ChromaDB Integration

The module uses ChromaDB for vector storage:

Collection Structure

  • Documents: Paper abstracts

  • Metadata: Paper ID, title, year, etc.

  • Embeddings: Vector representations

  • IDs: Unique identifiers (paper_id)

Collection Management

# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")

# Clear collection
em.clear_collection()

# Check if paper exists
exists = em.paper_exists(paper_id=123)

Search Results Format

Search results are returned as a list of dictionaries:

[
    {
        'id': 'paper_123',
        'title': 'Paper Title',
        'abstract': 'Abstract text...',
        'year': 2025,
        'distance': 0.234,  # Lower is more similar
    },
    # ... more results
]

Performance Considerations

Batch Processing

Process papers in batches for better performance:

# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)

# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)

Caching

ChromaDB caches embeddings on disk:

  • Location: Specified by embedding_db_path

  • Persistence: Embeddings persist across sessions

  • Updates: Only new papers are embedded

Memory Usage

Embedding models can be memory-intensive:

  • Smaller models: ~1-2 GB RAM

  • Larger models: 4-8 GB RAM

  • Batch size affects peak memory usage

Error Handling

try:
    em.create_embeddings_from_db()
except requests.RequestException:
    print("LM Studio connection failed")
except Exception as e:
    print(f"Embedding error: {e}")

Best Practices

  1. Create embeddings once - They’re cached and reused

  2. Use appropriate batch sizes - Balance speed and memory

  3. Filter searches - Use metadata filters to narrow results

  4. Choose good models - Larger models are more accurate but slower

  5. Monitor LM Studio - Ensure it’s running and model is loaded