Embeddings Module
The embeddings module provides vector embeddings functionality for semantic search using ChromaDB.
Overview
The EmbeddingsManager class handles:
Creating vector embeddings from paper abstracts
Storing embeddings in ChromaDB
Semantic similarity search
Integration with LM Studio embedding models
Class Reference
Embeddings Module
This module provides functionality to generate text embeddings for paper abstracts and store them in a vector database with paper metadata.
The module uses an OpenAI-compatible API (such as LM Studio or blablador) to generate embeddings and stores them in ChromaDB for efficient similarity search.
- class abstracts_explorer.embeddings.RateLimitedTransport(transport, requests_per_minute)[source]
Bases:
BaseTransportAn httpx transport that enforces a maximum requests-per-minute rate.
Wraps an existing transport and sleeps between requests to stay within the configured rate limit.
- Parameters:
transport (httpx.BaseTransport) – The underlying transport to delegate requests to.
requests_per_minute (int) – Maximum number of requests per minute. Must be > 0.
- exception abstracts_explorer.embeddings.EmbeddingsError[source]
Bases:
ExceptionException raised for embedding operations.
- class abstracts_explorer.embeddings.EmbeddingsManager(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]
Bases:
objectManager for generating and storing text embeddings.
This class handles: - Connecting to OpenAI-compatible API for embedding generation - Creating and managing a ChromaDB collection - Embedding paper abstracts with metadata - Similarity search operations
- Parameters:
lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint, by default “http://localhost:1234”
model_name (str, optional) – Name of the embedding model, by default “text-embedding-qwen3-embedding-4b”
collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”
requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).
- client
ChromaDB client instance. Connected automatically on first access.
- Type:
chromadb.Client
- collection
Active ChromaDB collection. Created automatically on first access (which also connects the client if not yet connected).
- Type:
chromadb.Collection
Examples
>>> em = EmbeddingsManager() >>> em.add_paper(paper_dict) # connect() and create_collection() called automatically >>> results = em.search_similar("machine learning", n_results=5) >>> em.close()
- __init__(lm_studio_url=None, auth_token=None, model_name=None, collection_name=None, requests_per_minute=None)[source]
Initialize the EmbeddingsManager.
Parameters are optional and will use values from environment/config if not provided.
- Parameters:
lm_studio_url (str, optional) – URL of the OpenAI-compatible API endpoint. If None, uses config value.
model_name (str, optional) – Name of the embedding model. If None, uses config value.
collection_name (str, optional) – Name of the ChromaDB collection. If None, uses config value.
requests_per_minute (int, optional) – Maximum number of API requests per minute. Set to 0 to disable rate limiting. If None, uses the value from config (default: 60).
- property client: Any
Get the ChromaDB client, connecting automatically on first access.
- Returns:
Initialized ChromaDB client instance.
- Return type:
chromadb.Client
- Raises:
EmbeddingsError – If connecting to ChromaDB fails.
- property collection: Any
Get the ChromaDB collection, creating it automatically on first access.
Calling this property for the first time also triggers
connect()if the client has not been initialized yet.- Returns:
Initialized ChromaDB collection.
- Return type:
chromadb.Collection
- Raises:
EmbeddingsError – If connecting to ChromaDB or creating the collection fails.
- property openai_client: OpenAI
Get the OpenAI client, creating it lazily on first access.
When
requests_per_minuteis greater than 0 aRateLimitedTransportis wrapped around the default httpx transport and passed as thehttp_clientargument so that every HTTP request is automatically throttled.This lazy loading prevents API calls during test collection.
- Returns:
Initialized OpenAI client instance.
- Return type:
OpenAI
- connect()[source]
Connect to ChromaDB.
Uses HTTP client if embedding_db is a URL, otherwise uses persistent client with local storage directory.
- Raises:
EmbeddingsError – If connection fails.
- Return type:
- test_lm_studio_connection()[source]
Test connection to OpenAI-compatible API endpoint.
- Returns:
True if connection is successful, False otherwise.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> if em.test_lm_studio_connection(): ... print("API is accessible")
- generate_embedding(text)[source]
Generate embedding for a given text using OpenAI-compatible API.
Rate limiting (if configured via
requests_per_minute) is handled transparently by the underlyinghttpxtransport.- Parameters:
text (str) – Text to generate embedding for.
- Returns:
Embedding vector.
- Return type:
List[float]
- Raises:
EmbeddingsError – If embedding generation fails.
Examples
>>> em = EmbeddingsManager() >>> embedding = em.generate_embedding("Sample text") >>> len(embedding) 4096
- create_collection(reset=False)[source]
Create or get ChromaDB collection.
- Parameters:
reset (bool, optional) – If True, delete existing collection and create new one, by default False
- Raises:
EmbeddingsError – If collection creation fails or not connected.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.create_collection(reset=True) # Reset existing collection
- paper_exists(paper_id)[source]
Check if a paper already exists in the collection.
- Parameters:
- Returns:
True if paper exists in collection, False otherwise.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.paper_exists("uid1") False >>> em.add_paper(paper_dict) >>> em.paper_exists("uid1") True
- paper_needs_update(paper)[source]
Check if a paper needs to be updated in the collection.
- Parameters:
paper (dict) – Dictionary containing paper information.
- Returns:
True if the paper needs to be updated, False otherwise.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.paper_needs_update({"id": 1, "abstract": "Updated abstract"}) True >>> em.paper_needs_update({"id": 1, "abstract": "This paper presents..."}) False
- static embedding_text_from_paper(paper)[source]
Extract text for embedding from a paper dictionary.
- static parse_chromadb_metadata(metadata)[source]
Parse a raw ChromaDB metadata dict through the LightweightPaper model.
ChromaDB stores all values as strings (see
add_paper()). This method converts a raw metadata dict into one with properly typed values by running it throughprepare_chroma_db_paper_data()and then validating viaLightweightPaper.- Parameters:
metadata (dict) – Raw metadata dictionary from ChromaDB.
- Returns:
Metadata dictionary with values converted to their canonical types. Authors will be a
list[str]and keywords alist[str].- Return type:
Examples
>>> raw = {"title": "My Paper", "year": "2024", "original_id": "42", ... "authors": "Alice;Bob", "abstract": "An abstract", ... "session": "ML", "poster_position": "1", ... "conference": "NeurIPS"} >>> parsed = EmbeddingsManager.parse_chromadb_metadata(raw) >>> parsed["year"] 2024 >>> parsed["authors"] ['Alice', 'Bob']
See also
LightweightPaperPydantic model used for validation.
prepare_chroma_db_paper_dataConverts ChromaDB string fields to proper types before validation.
- add_paper(paper)[source]
Add a paper to the vector database.
- Parameters:
paper (dict) – Dictionary containing paper information. Must follow the paper database schema.
- Raises:
EmbeddingsError – If adding paper fails or collection not initialized.
- Return type:
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> em.add_paper(paper_dict)
- search_similar(query, n_results=10, where=None)[source]
Search for similar papers using semantic similarity.
- Parameters:
- Returns:
Search results containing ids, distances, documents, and metadatas.
- Return type:
- Raises:
EmbeddingsError – If search fails or collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> results = em.search_similar("deep learning transformers", n_results=5, where={"year": 2025}) >>> for i, paper_id in enumerate(results['ids'][0]): ... print(f"{i+1}. Paper {paper_id}: {results['metadatas'][0][i]}")
- get_collection_stats()[source]
Get statistics about the collection.
- Returns:
Statistics including count, name, and metadata.
- Return type:
- Raises:
EmbeddingsError – If collection not initialized.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> stats = em.get_collection_stats() >>> print(f"Collection has {stats['count']} papers")
- check_model_compatibility()[source]
Check if the current embedding model matches the one stored in the database.
- Returns:
compatible: True if models match or no model is stored, False if they differ
stored_model: Name of the model stored in the database, or None if not set
current_model: Name of the current model
- Return type:
- Raises:
EmbeddingsError – If database operations fail.
Examples
>>> em = EmbeddingsManager() >>> compatible, stored, current = em.check_model_compatibility() >>> if not compatible: ... print(f"Model mismatch: stored={stored}, current={current}")
- embed_from_database(where_clause=None, progress_callback=None, force_recreate=False)[source]
Embed papers from the database.
Reads papers from the database and generates embeddings for their abstracts.
- Parameters:
where_clause (str, optional) – SQL WHERE clause to filter papers (e.g., “decision = ‘Accept’”)
progress_callback (callable, optional) – Callback function to report progress. Called with (current, total) number of papers processed.
force_recreate (bool, optional) – If True, skip checking for existing embeddings and recreate all, by default False
- Returns:
Number of papers successfully embedded.
- Return type:
- Raises:
EmbeddingsError – If database reading or embedding fails.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> count = em.embed_from_database() >>> print(f"Embedded {count} papers") >>> # Only embed accepted papers >>> count = em.embed_from_database(where_clause="decision = 'Accept'")
- search_papers_semantic(query, database, limit=10, sessions=None, years=None, conferences=None)[source]
Perform semantic search for papers using embeddings.
This function combines embedding-based similarity search with metadata filtering and retrieves complete paper information from the database.
- Parameters:
query (str) – Search query text
database (DatabaseManager) – Database manager for retrieving full paper details
limit (int, optional) – Maximum number of results to return, by default 10
conferences (list of str, optional) – Filter by conference names
- Returns:
List of paper dictionaries with complete information
- Return type:
- Raises:
EmbeddingsError – If search fails
Examples
>>> papers = em.search_papers_semantic( ... "transformers in vision", ... database=db, ... limit=5, ... years=[2024, 2025] ... )
- find_papers_within_distance(database, query, distance_threshold=1.1, conferences=None, years=None)[source]
Find papers within a specified distance from a custom search query.
This method treats the search query as a clustering center and returns papers within the specified Euclidean distance radius in embedding space.
- Parameters:
database (DatabaseManager) – Database manager instance for retrieving paper details
query (str) – The search query text
distance_threshold (float, optional) – Euclidean distance radius, by default 1.1
conferences (list[str], optional) – Filter results to only include papers from these conferences
years (list[int], optional) – Filter results to only include papers from these years
- Returns:
Dictionary containing: - query: str - The search query - query_embedding: list[float] - The generated embedding for the query - distance: float - The distance threshold used - papers: list[dict] - Papers within the distance radius with their distances - count: int - Number of papers found within the distance threshold - total_considered: int - Total number of papers matching the
conference/year filters (before distance filtering)
- Return type:
- Raises:
EmbeddingsError – If embeddings collection is empty or operation fails
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> db = DatabaseManager() >>> db.connect() >>> results = em.find_papers_within_distance(db, "machine learning", 1.1) >>> print(f"Found {results['count']} papers") >>> >>> # With filters >>> results = em.find_papers_within_distance( ... db, "deep learning", 1.1, ... conferences=["NeurIPS"], ... years=[2023, 2024] ... )
- export_embeddings(conference, year)[source]
Export embeddings for a given conference and year to a JSON-serializable dict.
- Parameters:
- Returns:
Dictionary containing
ids,documents,metadatas, andembeddingslists. Embedding vectors are converted to plain Python lists so the returned dict is always JSON-serializable.- Return type:
- Raises:
EmbeddingsError – If the export fails.
- import_embeddings(data, conference, year, batch_size=100)[source]
Import embeddings for a given conference and year from a dictionary.
Existing embeddings for the same conference and year are deleted before importing (replace semantics).
- Parameters:
data (dict) – Dictionary with
ids,documents,metadatas, andembeddingslists (as returned byexport_embeddings()).conference (str) – Conference name being imported.
year (int) – Year being imported.
batch_size (int) – Number of embeddings to add per batch.
- Returns:
Number of embeddings imported.
- Return type:
- Raises:
EmbeddingsError – If the import fails.
Usage Examples
Basic Setup
from abstracts_explorer.embeddings import EmbeddingsManager
# Initialize with database paths
em = EmbeddingsManager(
collection_name="papers"
)
Creating Embeddings
# Create embeddings from all papers in database
em.create_embeddings_from_db()
# Create embeddings for specific papers
papers = [
{
'id': 1,
'title': 'Example Paper',
'abstract': 'This is the abstract...',
'year': 2025
}
]
em.add_papers(papers)
Semantic Search
# Search by semantic similarity
results = em.search(
query="transformer architecture",
n_results=10
)
for result in results:
print(f"{result['title']}")
print(f"Similarity: {result['distance']:.3f}")
print()
Filtered Search
# Search with metadata filters
results = em.search(
query="deep learning",
n_results=5,
where={"year": 2025}
)
# Multiple filters
results = em.search(
query="neural networks",
n_results=10,
where={
"year": {"$gte": 2023},
"title": {"$contains": "transformer"}
}
)
Embedding Models
The module supports any embedding model available through LM Studio:
Popular Models
text-embedding-qwen3-embedding-4b(default)text-embedding-nomic-embed-text-v1.5all-MiniLM-L6-v2
Configuring Model
# Via configuration
from abstracts_explorer.config import get_config
config = get_config()
# Set EMBEDDING_MODEL in .env file
# Or directly
em = EmbeddingsManager(
model="text-embedding-nomic-embed-text-v1.5"
)
ChromaDB Integration
The module uses ChromaDB for vector storage:
Collection Structure
Documents: Paper abstracts
Metadata: Paper ID, title, year, etc.
Embeddings: Vector representations
IDs: Unique identifiers (paper_id)
Collection Management
# Get collection info
collection = em.collection
print(f"Total papers: {collection.count()}")
# Clear collection
em.clear_collection()
# Check if paper exists
exists = em.paper_exists(paper_id=123)
Search Results Format
Search results are returned as a list of dictionaries:
[
{
'id': 'paper_123',
'title': 'Paper Title',
'abstract': 'Abstract text...',
'year': 2025,
'distance': 0.234, # Lower is more similar
},
# ... more results
]
Performance Considerations
Batch Processing
Process papers in batches for better performance:
# Default batch size: 100
em.create_embeddings_from_db(batch_size=100)
# Larger batches for more memory
em.create_embeddings_from_db(batch_size=500)
Caching
ChromaDB caches embeddings on disk:
Location: Specified by
embedding_db_pathPersistence: Embeddings persist across sessions
Updates: Only new papers are embedded
Memory Usage
Embedding models can be memory-intensive:
Smaller models: ~1-2 GB RAM
Larger models: 4-8 GB RAM
Batch size affects peak memory usage
Error Handling
try:
em.create_embeddings_from_db()
except requests.RequestException:
print("LM Studio connection failed")
except Exception as e:
print(f"Embedding error: {e}")
Best Practices
Create embeddings once - They’re cached and reused
Use appropriate batch sizes - Balance speed and memory
Filter searches - Use metadata filters to narrow results
Choose good models - Larger models are more accurate but slower
Monitor LM Studio - Ensure it’s running and model is loaded