Embeddings Module
The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.
Overview
The EmbeddingsManager class handles:
Creating vector embeddings from paper abstracts
Storing embeddings in ChromaDB
Semantic similarity search
Integration with LM Studio embedding models
Class Reference
Embeddings Module
This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.
The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.
- exception abstracts_explorer.embeddings.EmbeddingsError[source]
Bases:
ExceptionException raised for embedding operations.
- class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]
Bases:
objectManager for generating and storing text embeddings.
This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations
- Parameters:
lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234”
model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”
collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”
- client
ChromaDB client instance.
- Type:
chromadb.Client or None
- collection
Active ChromaDB collection.
- Type:
chromadb.Collection or None
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.add_paper(paper_dict) >>> results = em.search_similar("machine learning", n_results=5) >>> em.close()
- __init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None)[source]
Initialize the EmbeddingsManager.
Parameters are optional and will use values from environment/config if not provided.
- property openai_client: OpenAI
Get the OpenAI client, creating it lazily on first access.
This lazy loading prevents API calls during test collection.
- Returns:
Initialized OpenAI client instance.
- Return type:
OpenAI
- connect()[source]
Connect to ChromaDB.
Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.
- Raises:
EmbeddingsError – If connection fails.
- Return type:
- test_lm_studio_connection()[source]
Test connection to OpenAI-compatible API endpoint.
- Returns:
True if connection is successful, False otherwise.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> if em.test_lm_studio_connection(): ... print("API is accessible")
- generate_embedding(text)[source]
Generate embedding for a given text using OpenAI-compatible API.
- Parameters:
text (str) – Text to generate embedding for.
- Returns:
Embedding vector.
- Return type:
List[float]
- Raises:
EmbeddingsError – If embedding generation fails.
Examples
>>> em = EmbeddingsManager() >>> embedding = em.generate_embedding("Sample text") >>> len(embedding) 4096
- create_collection(reset=False)[source]
Create or get ChromaDB collection.
- Parameters:
reset (bool, optional) – If True, delete existing collection and create new one, by default False
- Raises:
EmbeddingsError – If collection creation fails or not connected.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.create_collection(reset=True) # Reset existing collection
- paper_exists(paper_id)[source]
Check if a paper already exists in the collection.
- Parameters:
- Returns:
True if paper exists in collection, False otherwise.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.paper_exists("uid1") False >>> em.add_paper(paper_dict) >>> em.paper_exists("uid1") True
- paper_needs_update(paper)[source]
Check if a paper needs to be updated in the collection.
- Parameters:
paper (dict) – Dictionary containing paper information.
- Returns:
True if the paper needs to be updated, False otherwise.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"}) True >>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."}) False
- static embedding_text_from_paper(paper)[source]
Extract text for embedding from a paper dictionary.
- add_paper(paper)[source]
Add a paper to the vector database.
- Parameters:
paper (dict) – Dictionary containing paper information. Must follow the paper database schema.
- Raises:
EmbeddingsError – If adding paper fails or collection not initialized.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.add_paper(paper_dict)
- search_similar(query, n_results=10, where=None)[source]
Search for similar papers using semantic similarity.
- Parameters:
- Returns:
Search results containing ids, distances, documents, and metadatas.
- Return type:
- Raises:
EmbeddingsError – If search fails or collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025}) >>> for i, paper_id in enumerate(results['ids'][0]): ... print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")
- get_collection_stats()[source]
Get statistics about the collection.
- Returns:
Statistics including count, name, and metadata.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> stats = em.get_collection_stats() >>> print(f"Collection has {stats['count']} papers")
- check_model_compatibility()[source]
Check if the current embedding model matches the one stored in the database.
- Returns:
compatible: True if models match or no model is stored, False if they differ
stored_model: Name of the model stored in the database, or None if not set
current_model: Name of the current model
- Return type:
- Raises:
EmbeddingsError – If database operations fail.
Examples
>>> em = EmbeddingsManager() >>> compatible, stored, current = em.check_model_compatibility() >>> if not compatible: ... print(f"Model mismatch: stored={stored}, current={current}")
- embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]
Embed papers from the database.
Reads papers from the database and generates embeddings for their abstracts.
- Parameters:
where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)
progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.
force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False
- Returns:
Number of papers successfully embedded.
- Return type:
- Raises:
EmbeddingsError – If database reading or embedding fails.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> count = em.embed_from_database() >>> print(f"Embedded {count} papers") >>> # Only embed accepted papers >>> count = em.embed_from_database(where_clause="decision = 'Accept'")
- search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None)[source]
Perform semantic search for papers using embeddings.
This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.
- Parameters:
query (str) – Search query text
database (DatabaseManager) – Database manager for retrieving full paper details
limit (int, optional) – Maximum number of results to return, by default 10
conferences (list of str, optional) – Filter by conference names
- Returns:
List of paper dictionaries with complete information
- Return type:
- Raises:
EmbeddingsError – If search fails
Examples
>>> papers = em.search_papers_semantic( ... "transformers in vision", ... database=db, ... limit=5, ... years=[2024, 2025] ... )
- find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]
Find papers within a specified distance from a custom search query.
This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.
- Parameters:
database (DatabaseManager) – Database manager instance for retrieving paper details
query (str) – The search query text
distance_threshold (float, optional) – Euclidean distance radius, by default 1.1
conferences (list[str], optional) – Filter results to only include papers from these conferences
years (list[int], optional) – Filter results to only include papers from these years
- Returns:
Dictionary containing: - query: str - The search query - query_embedding: list[float] - The generated embedding for the query - distance: float - The distance threshold used - papers: list[dict] - Papers within the distance radius with their distances - count: int - Number of papers found
- Return type:
- Raises:
EmbeddingsError – If embeddings collection is empty or operation fails
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> db = DatabaseManager() >>> db.connect() >>> results = em.find_papers_within_distance(db, "machine learning", 1.1) >>> print(f"Found {results['count']} papers") >>> >>> # With filters >>> results = em.find_papers_within_distance( ... db, "deep learning", 1.1, ... conferences=["NeurIPS"], ... years=[2023, 2024] ... )
Usage Examples
Basic Setup
from abstracts_explorer.embeddings import EmbeddingsManager
# Initialize with database paths
em = EmbeddingsManager(
collection_name="papers"
)
Creating Embeddings
# Create embeddings from all papers in database
em.create_embeddings_from_db()
# Create embeddings for specific papers
papers = [
{
'id': 1,
'title': 'Example Paper',
'abstract': 'This is the abstract...',
'year': 2025
}
]
em.add_papers(papers)
Semantic Search
# Search by semantic similarity
results = em.search(
query="transformer architecture",
n_results=10
)
for result in results:
print(f"{result['title']}")
print(f"Similarity: {result['distance']:.3f}")
print()
Filtered Search
# Search with metadata filters
results = em.search(
query="deep learning",
n_results=5,
where={"year": 2025}
)
# Multiple filters
results = em.search(
query="neural networks",
n_results=10,
where={
"year": {"$gte": 2023},
"title": {"$contains": "transformer"}
}
)
Embedding Models
The module supports any embedding model available through LM Studio:
Popular Models
text-embedding-qwen3-embedding-4b(default)text-embedding-nomic-embed-text-v1.5all-MiniLM-L6-v2
Configuring Model
# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file
# Or directly
em = EmbeddingsManager(
model="text-embedding-nomic-embed-text-v1.5"
)
ChromaDB Integration
The module uses ChromaDB for vector storage:
Collection Structure
Documents: Paper abstracts
Metadata: Paper ID, title, year, etc.
Embeddings: Vector representations
IDs: Unique identifiers (paper_id)
Collection Management
# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")
# Clear collection
em.clear_collection()
# Check if paper exists
exists = em.paper_exists(paper_id=123)
Search Results Format
Search results are returned as a list of dictionaries:
[
{
'id': 'paper_123',
'title': 'Paper Title',
'abstract': 'Abstract text...',
'year': 2025,
'distance': 0.234, # Lower is more similar
},
# ... more results
]
Performance Considerations
Batch Processing
Process papers in batches for better performance:
# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)
# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)
Caching
ChromaDB caches embeddings on disk:
Location: Specified by
embedding_db_pathPersistence: Embeddings persist across sessions
Updates: Only new papers are embedded
Memory Usage
Embedding models can be memory-intensive:
Smaller models: ~1-2 GB RAM
Larger models: 4-8 GB RAM
Batch size affects peak memory usage
Error Handling
try:
em.create_embeddings_from_db()
except requests.RequestException:
print("LM Studio connection failed")
except Exception as e:
print(f"Embedding error: {e}")
Best Practices
Create embeddings once - They’re cached and reused
Use appropriate batch sizes - Balance speed and memory
Filter searches - Use metadata filters to narrow results
Choose good models - Larger models are more accurate but slower
Monitor LM Studio - Ensure it’s running and model is loaded