Database Module
The database module provides an interface for storing and retrieving conference papers in SQLite.
Overview
The DatabaseManager class handles all database operations including:
Creating and managing database schema
Adding papers and authors
Searching and filtering papers
Managing paper-author relationships
Class Reference
Database Module
This module provides functionality to load JSON data into a SQL database. Supports both SQLite and PostgreSQL backends via SQLAlchemy.
- exception abstracts_explorer.database.DatabaseError[source]
Bases:
ExceptionException raised for database operations.
- class abstracts_explorer.database.DatabaseManager[source]
Bases:
objectManager for SQL database operations using SQLAlchemy.
Supports SQLite and PostgreSQL backends through SQLAlchemy connection URLs. Database configuration is read from the config file (PAPER_DB variable).
- engine
SQLAlchemy engine instance.
- Type:
Engine or None
- SessionLocal
SQLAlchemy session factory.
- Type:
sessionmaker or None
- _session
Active database session if connected.
- Type:
Session or None
Examples
>>> # Database configuration comes from config file >>> db = DatabaseManager() >>> db.connect() >>> db.create_tables() >>> db.close()
- __init__()[source]
Initialize the DatabaseManager.
Reads database configuration from the config file.
- connect()[source]
Connect to the database.
Creates the database file if it doesn’t exist (SQLite only).
- Raises:
DatabaseError – If connection fails.
- Return type:
- create_tables()[source]
Create database tables for papers and embeddings metadata.
Creates the following tables: - papers: Main table for paper information with lightweight ML4PS schema - embeddings_metadata: Metadata about embeddings (model used, creation date) - clustering_cache: Cache for clustering results
This method is idempotent - it can be called multiple times without error. Tables are only created if they don’t already exist.
- Raises:
DatabaseError – If table creation fails.
- Return type:
- add_paper(paper)[source]
Add a single paper to the database.
- Parameters:
paper (LightweightPaper) – Validated paper object to insert.
- Returns:
The UID of the inserted paper, or None if paper was skipped (duplicate).
- Return type:
str or None
- Raises:
DatabaseError – If insertion fails.
Examples
>>> from abstracts_explorer.plugin import LightweightPaper >>> db = DatabaseManager() >>> with db: ... db.create_tables() ... paper = LightweightPaper( ... title="Test Paper", ... authors=["John Doe"], ... abstract="Test abstract", ... session="Session 1", ... poster_position="P1", ... year=2025, ... conference="NeurIPS" ... ) ... paper_uid = db.add_paper(paper) >>> print(f"Inserted paper with UID: {paper_uid}")
- add_papers(papers)[source]
Add multiple papers to the database in a batch.
- Parameters:
papers (list of LightweightPaper) – List of validated paper objects to insert.
- Returns:
Number of papers successfully inserted (excludes duplicates).
- Return type:
- Raises:
DatabaseError – If batch insertion fails.
Examples
>>> from abstracts_explorer.plugin import LightweightPaper >>> db = DatabaseManager() >>> with db: ... db.create_tables() ... papers = [ ... LightweightPaper( ... title="Paper 1", ... authors=["Author 1"], ... abstract="Abstract 1", ... session="Session 1", ... poster_position="P1", ... year=2025, ... conference="NeurIPS" ... ), ... LightweightPaper( ... title="Paper 2", ... authors=["Author 2"], ... abstract="Abstract 2", ... session="Session 2", ... poster_position="P2", ... year=2025, ... conference="NeurIPS" ... ) ... ] ... count = db.add_papers(papers) >>> print(f"Inserted {count} papers")
- donate_validation_data(paper_priorities)[source]
Store donated paper rating data for validation purposes.
This method accepts anonymized paper ratings from users and stores them in the validation_data table for improving the service.
- Parameters:
paper_priorities (Dict[str, Dict[str, Any]]) – Dictionary mapping paper UIDs to priority data. Each priority data dict must contain: - priority (int): Rating value - searchTerm (str, optional): Search term associated with the rating
- Returns:
Number of papers successfully donated
- Return type:
- Raises:
ValueError – If paper_priorities is empty or contains invalid data format
DatabaseError – If database operation fails
Examples
>>> db = DatabaseManager() >>> with db: ... priorities = { ... "abc123": {"priority": 5, "searchTerm": "machine learning"}, ... "def456": {"priority": 4, "searchTerm": "deep learning"} ... } ... count = db.donate_validation_data(priorities) ... print(f"Donated {count} papers")
- query(sql, parameters=())[source]
Execute a SQL query and return results.
Note: This method provides backward compatibility with raw SQL queries. For new code, prefer using SQLAlchemy ORM methods.
- Parameters:
- Returns:
Query results as list of dictionaries.
- Return type:
- Raises:
DatabaseError – If query execution fails.
Examples
>>> db = DatabaseManager() >>> with db: ... results = db.query("SELECT * FROM papers WHERE session = ?", ("Poster",)) >>> for row in results: ... print(row['title'])
- get_paper_count()[source]
Get the total number of papers in the database.
- Returns:
Number of papers.
- Return type:
- Raises:
DatabaseError – If query fails.
- search_papers(keyword=None, session=None, sessions=None, year=None, years=None, conference=None, conferences=None, limit=100)[source]
Search for papers by various criteria (lightweight schema).
- Parameters:
keyword (str, optional) – Keyword to search in title, abstract, or keywords fields.
session (str, optional) – Single session to filter by (deprecated, use sessions instead).
sessions (list[str], optional) – List of sessions to filter by (matches ANY).
year (int, optional) – Single year to filter by (deprecated, use years instead).
years (list[int], optional) – List of years to filter by (matches ANY).
conference (str, optional) – Single conference to filter by (deprecated, use conferences instead).
conferences (list[str], optional) – List of conferences to filter by (matches ANY).
limit (int, default=100) – Maximum number of results to return.
- Returns:
Matching papers as dictionaries.
- Return type:
- Raises:
DatabaseError – If search fails.
Examples
>>> db = DatabaseManager("neurips.db") >>> with db: ... papers = db.search_papers(keyword="neural network", limit=10) >>> for paper in papers: ... print(paper['title'])
>>> # Search with multiple sessions >>> papers = db.search_papers(sessions=["Session 1", "Session 2"])
>>> # Search with years >>> papers = db.search_papers(years=[2024, 2025])
- search_papers_keyword(query, limit=10, sessions=None, years=None, conferences=None)[source]
Perform keyword-based search with filtering and author parsing.
This is a convenience method that wraps search_papers and formats the results for web API consumption, including author parsing.
- Parameters:
- Returns:
List of paper dictionaries with parsed authors
- Return type:
Examples
>>> papers = db.search_papers_keyword( ... "neural networks", ... limit=5, ... years=[2024, 2025] ... )
- get_stats(year=None, conference=None)[source]
Get database statistics, optionally filtered by year and conference.
- Parameters:
- Returns:
Statistics dictionary with: - total_papers: int - Number of papers matching filters - year: int or None - Filter year if provided - conference: str or None - Filter conference if provided
- Return type:
Examples
>>> stats = db.get_stats() >>> print(f"Total papers: {stats['total_papers']}")
>>> stats_2024 = db.get_stats(year=2024) >>> print(f"Papers in 2024: {stats_2024['total_papers']}")
- search_authors_in_papers(name=None, limit=100)[source]
Search for authors by name within the papers’ authors field.
- Parameters:
- Returns:
Unique authors found in papers with fields: name.
- Return type:
- Raises:
DatabaseError – If search fails.
Examples
>>> db = DatabaseManager() >>> with db: ... authors = db.search_authors_in_papers(name="Huang") >>> for author in authors: ... print(author['name'])
- get_author_count()[source]
Get the approximate number of unique authors in the database.
Note: This provides an estimate by counting unique author names across all papers. The actual count may vary.
- Returns:
Approximate number of unique authors.
- Return type:
- Raises:
DatabaseError – If query fails.
- get_filter_options(year=None, conference=None)[source]
Get distinct values for filterable fields (lightweight schema).
Returns a dictionary with lists of distinct values for session, year, and conference fields that can be used to populate filter dropdowns. Optionally filters by year and/or conference.
- Parameters:
- Returns:
Dictionary with keys ‘sessions’, ‘years’, ‘conferences’ containing lists of distinct non-null values sorted alphabetically (or numerically for years).
- Return type:
- Raises:
DatabaseError – If query fails.
Examples
>>> db = DatabaseManager() >>> with db: ... filters = db.get_filter_options() >>> print(filters['sessions']) ['Session 1', 'Session 2', ...] >>> print(filters['years']) [2023, 2024, 2025] >>> # Get filters for specific year >>> filters = db.get_filter_options(year=2025)
- get_embedding_model()[source]
Get the embedding model used for the current embeddings.
- Returns:
Name of the embedding model, or None if not set.
- Return type:
str or None
- Raises:
DatabaseError – If query fails.
Examples
>>> db = DatabaseManager() >>> with db: ... model = db.get_embedding_model() >>> print(model) 'text-embedding-qwen3-embedding-4b'
- set_embedding_model(model_name)[source]
Set the embedding model used for embeddings.
This stores or updates the embedding model metadata. If a record exists, it updates the model and timestamp. Otherwise, it creates a new record.
- Parameters:
model_name (str) – Name of the embedding model.
- Raises:
DatabaseError – If update fails.
- Return type:
Examples
>>> db = DatabaseManager() >>> with db: ... db.set_embedding_model("text-embedding-qwen3-embedding-4b")
- get_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, n_clusters=None, clustering_params=None)[source]
Get cached clustering results matching the parameters.
- Parameters:
embedding_model (str) – Name of the embedding model.
reduction_method (str) – Dimensionality reduction method.
n_components (int) – Number of components after reduction.
clustering_method (str) – Clustering algorithm used.
n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).
clustering_params (dict, optional) – Additional clustering parameters (e.g., distance_threshold, eps).
- Returns:
Cached clustering results as dictionary, or None if not found.
- Return type:
dict or None
- Raises:
DatabaseError – If query fails.
- save_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, results, n_clusters=None, clustering_params=None)[source]
Save clustering results to cache.
- Parameters:
embedding_model (str) – Name of the embedding model.
reduction_method (str) – Dimensionality reduction method.
n_components (int) – Number of components after reduction.
clustering_method (str) – Clustering algorithm used.
results (dict) – Clustering results to cache.
n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).
clustering_params (dict, optional) – Additional clustering parameters.
- Raises:
DatabaseError – If save fails.
- Return type:
- clear_clustering_cache(embedding_model=None)[source]
Clear clustering cache, optionally filtered by embedding model.
This is useful when embeddings change or cache becomes stale.
- Parameters:
embedding_model (str, optional) – If provided, only clear cache for this embedding model. If None, clear all cache entries.
- Returns:
Number of cache entries deleted.
- Return type:
- Raises:
DatabaseError – If deletion fails.
Usage Examples
Basic Operations
from abstracts_explorer.database import DatabaseManager
# Initialize database
db = DatabaseManager()
# Add a paper
paper_id = db.add_paper({
'openreview_id': 'abc123',
'title': 'Example Paper',
'abstract': 'This is an example abstract.',
'year': 2025,
'pdf_url': 'https://example.com/paper.pdf'
})
# Add authors
db.add_author(paper_id, 'John Doe', 0)
db.add_author(paper_id, 'Jane Smith', 1)
Searching Papers
# Search by title
results = db.search_papers(title="transformer")
# Search by abstract content
results = db.search_papers(abstract="attention mechanism")
# Filter by year
results = db.get_papers_by_year(2025)
# Get papers by author
results = db.get_papers_by_author("John Doe")
Retrieving Data
# Get all papers
all_papers = db.get_all_papers()
# Get specific paper
paper = db.get_paper_by_id(paper_id)
# Get authors for a paper
authors = db.get_authors_for_paper(paper_id)
Database Schema
papers Table
Column |
Type |
Description |
|---|---|---|
id |
INTEGER |
Primary key (auto-increment) |
openreview_id |
TEXT |
Unique OpenReview ID |
title |
TEXT |
Paper title |
abstract |
TEXT |
Paper abstract |
year |
INTEGER |
Conference year |
pdf_url |
TEXT |
URL to PDF |
created_at |
TIMESTAMP |
Creation timestamp |
Error Handling
The database module raises standard SQLite exceptions. Wrap operations in try-except blocks:
try:
db.add_paper(paper_data)
except sqlite3.IntegrityError:
print("Paper already exists")
except Exception as e:
print(f"Database error: {e}")