Embeddings Module

The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.

Overview

The EmbeddingsManager class handles:

Creating vector embeddings from paper abstracts
Storing embeddings in ChromaDB
Semantic similarity search
Integration with LM Studio embedding models

Class Reference

Embeddings Module

This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.

The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.

exception abstracts_explorer.embeddings.EmbeddingsError[source]

Bases: Exception

Exception raised for embedding operations.

class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]

Bases: object

Manager for generating and storing text embeddings.

This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations

Parameters:

lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234”
model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”
collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”

lm_studio_url

OpenAI-compatible API endpoint URL.

Type:: str

model_name

Embedding model name.

Type:: str

embedding_db

ChromaDB configuration - URL for HTTP service or path for local storage.

Type:: str

collection_name

ChromaDB collection name.

Type:: str

client

ChromaDB client instance.

Type:: chromadb.Client or None

collection

Active ChromaDB collection.

Type:: chromadb.Collection or None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)
>>> results = em.search_similar("machine learning", n_results=5)
>>> em.close()

__init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]

Initialize the EmbeddingsManager.

Parameters are optional and will use values from environment/config if not provided.

Parameters:

lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint. If None, uses config value.
model_name (str, optional) – Name of the embedding model. If None, uses config value.
collection_name (str, optional) – Name of the ChromaDB collection. If None, uses config value.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

This lazy loading prevents API calls during test collection.

Returns:: Initialized OpenAI client instance.
Return type:: OpenAI

connect()[source]

Connect to ChromaDB.

Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.

Raises:: EmbeddingsError – If connection fails.
Return type:: None

close()[source]

Close the ChromaDB connection.

Does nothing if not connected.

Return type:: None

__enter__()[source]: Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]: Context manager exit.

test_lm_studio_connection()[source]

Test connection to OpenAI-compatible API endpoint.

Returns:: True if connection is successful, False otherwise.
Return type:: bool

Examples

>>> em = EmbeddingsManager()
>>> if em.test_lm_studio_connection():
...     print("API is accessible")

generate_embedding(text)[source]

Generate embedding for a given text using OpenAI-compatible API.

Parameters:: text (str) – Text to generate embedding for.
Returns:: Embedding vector.
Return type:: List[float]
Raises:: EmbeddingsError – If embedding generation fails.

Examples

>>> em = EmbeddingsManager()
>>> embedding = em.generate_embedding("Sample text")
>>> len(embedding)
4096

create_collection(reset=False)[source]

Create or get ChromaDB collection.

Parameters:: reset (bool, optional) – If True, delete existing collection and create new one, by default False
Raises:: EmbeddingsError – If collection creation fails or not connected.
Return type:: None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.create_collection(reset=True)  # Reset existing collection

paper_exists(paper_id)[source]

Check if a paper already exists in the collection.

Parameters:: paper_id (int or str) – Unique identifier for the paper.
Returns:: True if paper exists in collection, False otherwise.
Return type:: bool
Raises:: EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_exists("uid1")
False
>>> em.add_paper(paper_dict)
>>> em.paper_exists("uid1")
True

paper_needs_update(paper)[source]

Check if a paper needs to be updated in the collection.

Parameters:: paper (dict) – Dictionary containing paper information.
Returns:: True if the paper needs to be updated, False otherwise.
Return type:: bool
Raises:: EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"})
True
>>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."})
False

static embedding_text_from_paper(paper)[source]

Extract text for embedding from a paper dictionary.

Parameters:: paper (dict) – Dictionary containing paper information.
Returns:: Text to be used for embedding.
Return type:: str

add_paper(paper)[source]

Add a paper to the vector database.

Parameters:: paper (dict) – Dictionary containing paper information. Must follow the paper database schema.
Raises:: EmbeddingsError – If adding paper fails or collection not initialized.
Return type:: None

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> em.add_paper(paper_dict)

search_similar(query, n_results=10, where=None)[source]

Search for similar papers using semantic similarity.

Parameters:

query (str) – Query text to search for.
n_results (int, optional) – Number of results to return, by default 10
where (dict, optional) – Metadata filter conditions.

Returns:

Search results containing ids, distances, documents, and metadatas.

Return type:

dict

Raises:

EmbeddingsError – If search fails or collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025})
>>> for i, paper_id in enumerate(results['ids'][0]):
...     print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")

get_collection_stats()[source]

Get statistics about the collection.

Returns:: Statistics including count, name, and metadata.
Return type:: dict
Raises:: EmbeddingsError – If collection not initialized.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> stats = em.get_collection_stats()
>>> print(f"Collection has {stats['count']} papers")

check_model_compatibility()[source]

Check if the current embedding model matches the one stored in the database.

Returns:

compatible: True if models match or no model is stored, False if they differ
stored_model: Name of the model stored in the database, or None if not set
current_model: Name of the current model

Return type:

tuple of (bool, str or None, str or None)

Raises:

EmbeddingsError – If database operations fail.

Examples

>>> em = EmbeddingsManager()
>>> compatible, stored, current = em.check_model_compatibility()
>>> if not compatible:
...     print(f"Model mismatch: stored={stored}, current={current}")

embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]

Embed papers from the database.

Reads papers from the database and generates embeddings for their abstracts.

Parameters:

where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)
progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.
force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False

Returns:

Number of papers successfully embedded.

Return type:

int

Raises:

EmbeddingsError – If database reading or embedding fails.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> count = em.embed_from_database()
>>> print(f"Embedded {count} papers")
>>> # Only embed accepted papers
>>> count = em.embed_from_database(where_clause="decision = 'Accept'")

search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None)[source]

Perform semantic search for papers using embeddings.

This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.

Parameters:

query (str) – Search query text
database (DatabaseManager) – Database manager for retrieving full paper details
limit (int, optional) – Maximum number of results to return, by default 10
sessions (list of str, optional) – Filter by paper sessions
years (list of int, optional) – Filter by publication years
conferences (list of str, optional) – Filter by conference names

Returns:

List of paper dictionaries with complete information

Return type:

list of dict

Raises:

EmbeddingsError – If search fails

Examples

>>> papers = em.search_papers_semantic(
...     "transformers in vision",
...     database=db,
...     limit=5,
...     years=[2024, 2025]
... )

find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]

Find papers within a specified distance from a custom search query.

This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.

Parameters:

database (DatabaseManager) – Database manager instance for retrieving paper details
query (str) – The search query text
distance_threshold (float, optional) – Euclidean distance radius, by default 1.1
conferences (list[str], optional) – Filter results to only include papers from these conferences
years (list[int], optional) – Filter results to only include papers from these years

Returns:

Dictionary containing: - query: str - The search query - query_embedding: list[float] - The generated embedding for the query - distance: float - The distance threshold used - papers: list[dict] - Papers within the distance radius with their distances - count: int - Number of papers found

Return type:

dict

Raises:

EmbeddingsError – If embeddings collection is empty or operation fails

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> db = DatabaseManager()
>>> db.connect()
>>> results = em.find_papers_within_distance(db, "machine learning", 1.1)
>>> print(f"Found {results['count']} papers")
>>>
>>> # With filters
>>> results = em.find_papers_within_distance(
...     db, "deep learning", 1.1,
...     conferences=["NeurIPS"],
...     years=[2023, 2024]
... )

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager

# Initialize with database paths
em = EmbeddingsManager(
    collection_name="papers"
)

Creating Embeddings

# Create embeddings from all papers in database
em.create_embeddings_from_db()

# Create embeddings for specific papers
papers = [
    {
        'id': 1,
        'title': 'Example Paper',
        'abstract': 'This is the abstract...',
        'year': 2025
    }
]
em.add_papers(papers)

Semantic Search

# Search by semantic similarity
results = em.search(
    query="transformer architecture",
    n_results=10
)

for result in results:
    print(f"{result['title']}")
    print(f"Similarity: {result['distance']:.3f}")
    print()

Filtered Search

# Search with metadata filters
results = em.search(
    query="deep learning",
    n_results=5,
    where={"year": 2025}
)

# Multiple filters
results = em.search(
    query="neural networks",
    n_results=10,
    where={
        "year": {"$gte": 2023},
        "title": {"$contains": "transformer"}
    }
)

Embedding Models

The module supports any embedding model available through LM Studio:

Popular Models

text-embedding-qwen3-embedding-4b (default)
text-embedding-nomic-embed-text-v1.5
all-MiniLM-L6-v2

Configuring Model

# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file

# Or directly
em = EmbeddingsManager(
    model="text-embedding-nomic-embed-text-v1.5"
)

ChromaDB Integration

The module uses ChromaDB for vector storage:

Collection Structure

Documents: Paper abstracts
Metadata: Paper ID, title, year, etc.
Embeddings: Vector representations
IDs: Unique identifiers (paper_id)

Collection Management

# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")

# Clear collection
em.clear_collection()

# Check if paper exists
exists = em.paper_exists(paper_id=123)

Search Results Format

Search results are returned as a list of dictionaries:

[
    {
        'id': 'paper_123',
        'title': 'Paper Title',
        'abstract': 'Abstract text...',
        'year': 2025,
        'distance': 0.234,  # Lower is more similar
    },
    # ... more results
]

Performance Considerations

Batch Processing

Process papers in batches for better performance:

# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)

# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)

Caching

ChromaDB caches embeddings on disk:

Location: Specified by embedding_db_path
Persistence: Embeddings persist across sessions
Updates: Only new papers are embedded

Memory Usage

Embedding models can be memory-intensive:

Smaller models: ~1-2 GB RAM
Larger models: 4-8 GB RAM
Batch size affects peak memory usage

Error Handling

try:
    em.create_embeddings_from_db()
except requests.RequestException:
    print("LM Studio connection failed")
except Exception as e:
    print(f"Embedding error: {e}")

Best Practices

Create embeddings once - They’re cached and reused
Use appropriate batch sizes - Balance speed and memory
Filter searches - Use metadata filters to narrow results
Choose good models - Larger models are more accurate but slower
Monitor LM Studio - Ensure it’s running and model is loaded