Database Module

The database module provides an interface for storing and retrieving conference papers in SQLite.

Overview

The DatabaseManager class handles all database operations including:

Creating and managing database schema
Adding papers and authors
Searching and filtering papers
Managing paper-author relationships

Class Reference

Database Module

This module provides functionality to load JSON data into a SQL database. Supports both SQLite and PostgreSQL backends via SQLAlchemy.

exception abstracts_explorer.database.DatabaseError[source]

Bases: Exception

Exception raised for database operations.

class abstracts_explorer.database.DatabaseManager[source]

Bases: object

Manager for SQL database operations using SQLAlchemy.

Supports SQLite and PostgreSQL backends through SQLAlchemy connection URLs. Database configuration is read from the config file (PAPER_DB variable).

database_url

SQLAlchemy database URL from configuration.

Type:: str

engine

SQLAlchemy engine instance.

Type:: Engine or None

SessionLocal

SQLAlchemy session factory.

Type:: sessionmaker or None

_session

Active database session if connected.

Type:: Session or None

Examples

>>> # Database configuration comes from config file
>>> db = DatabaseManager()
>>> db.connect()
>>> db.create_tables()
>>> db.close()

__init__()[source]

Initialize the DatabaseManager.

Reads database configuration from the config file.

connect()[source]

Connect to the database.

Creates the database file if it doesn’t exist (SQLite only).

Raises:: DatabaseError – If connection fails.
Return type:: None

close()[source]

Close the database connection.

Does nothing if not connected.

Return type:: None

__enter__()[source]: Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]: Context manager exit.

create_tables()[source]

Create database tables for papers and embeddings metadata.

Creates the following tables: - papers: Main table for paper information with lightweight ML4PS schema - embeddings_metadata: Metadata about embeddings (model used, creation date) - clustering_cache: Cache for clustering results

This method is idempotent - it can be called multiple times without error. Tables are only created if they don’t already exist.

Raises:: DatabaseError – If table creation fails.
Return type:: None

add_paper(paper)[source]

Add a single paper to the database.

Parameters:: paper (LightweightPaper) – Validated paper object to insert.
Returns:: The UID of the inserted paper, or None if paper was skipped (duplicate).
Return type:: str or None
Raises:: DatabaseError – If insertion fails.

Examples

>>> from abstracts_explorer.plugin import LightweightPaper
>>> db = DatabaseManager()
>>> with db:
...     db.create_tables()
...     paper = LightweightPaper(
...         title="Test Paper",
...         authors=["John Doe"],
...         abstract="Test abstract",
...         session="Session 1",
...         poster_position="P1",
...         year=2025,
...         conference="NeurIPS"
...     )
...     paper_uid = db.add_paper(paper)
>>> print(f"Inserted paper with UID: {paper_uid}")

add_papers(papers)[source]

Add multiple papers to the database in a batch.

Parameters:: papers (list of LightweightPaper) – List of validated paper objects to insert.
Returns:: Number of papers successfully inserted (excludes duplicates).
Return type:: int
Raises:: DatabaseError – If batch insertion fails.

Examples

>>> from abstracts_explorer.plugin import LightweightPaper
>>> db = DatabaseManager()
>>> with db:
...     db.create_tables()
...     papers = [
...         LightweightPaper(
...             title="Paper 1",
...             authors=["Author 1"],
...             abstract="Abstract 1",
...             session="Session 1",
...             poster_position="P1",
...             year=2025,
...             conference="NeurIPS"
...         ),
...         LightweightPaper(
...             title="Paper 2",
...             authors=["Author 2"],
...             abstract="Abstract 2",
...             session="Session 2",
...             poster_position="P2",
...             year=2025,
...             conference="NeurIPS"
...         )
...     ]
...     count = db.add_papers(papers)
>>> print(f"Inserted {count} papers")

donate_validation_data(paper_priorities)[source]

Store donated paper rating data for validation purposes.

This method accepts anonymized paper ratings from users and stores them in the validation_data table for improving the service.

Parameters:

paper_priorities (Dict[str, Dict[str, Any]]) – Dictionary mapping paper UIDs to priority data. Each priority data dict must contain: - priority (int): Rating value - searchTerm (str, optional): Search term associated with the rating

Returns:

Number of papers successfully donated

Return type:

int

Raises:

ValueError – If paper_priorities is empty or contains invalid data format
DatabaseError – If database operation fails

Examples

>>> db = DatabaseManager()
>>> with db:
...     priorities = {
...         "abc123": {"priority": 5, "searchTerm": "machine learning"},
...         "def456": {"priority": 4, "searchTerm": "deep learning"}
...     }
...     count = db.donate_validation_data(priorities)
...     print(f"Donated {count} papers")

query(sql, parameters=())[source]

Execute a SQL query and return results.

Note: This method provides backward compatibility with raw SQL queries. For new code, prefer using SQLAlchemy ORM methods.

Parameters:

sql (str) – SQL query to execute (use named parameters like :param1, :param2).
parameters (tuple, optional) – Query parameters for parameterized queries.

Returns:

Query results as list of dictionaries.

Return type:

list of dict

Raises:

DatabaseError – If query execution fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     results = db.query("SELECT * FROM papers WHERE session = ?", ("Poster",))
>>> for row in results:
...     print(row['title'])

get_paper_count()[source]

Get the total number of papers in the database.

Returns:: Number of papers.
Return type:: int
Raises:: DatabaseError – If query fails.

search_papers(keyword=None, session=None, sessions=None, year=None, years=None, conference=None, conferences=None, limit=100)[source]

Search for papers by various criteria (lightweight schema).

Parameters:

keyword (str, optional) – Keyword to search in title, abstract, or keywords fields.
session (str, optional) – Single session to filter by (deprecated, use sessions instead).
sessions (list[str], optional) – List of sessions to filter by (matches ANY).
year (int, optional) – Single year to filter by (deprecated, use years instead).
years (list[int], optional) – List of years to filter by (matches ANY).
conference (str, optional) – Single conference to filter by (deprecated, use conferences instead).
conferences (list[str], optional) – List of conferences to filter by (matches ANY).
limit (int, default=100) – Maximum number of results to return.

Returns:

Matching papers as dictionaries.

Return type:

list of dict

Raises:

DatabaseError – If search fails.

Examples

>>> db = DatabaseManager("neurips.db")
>>> with db:
...     papers = db.search_papers(keyword="neural network", limit=10)
>>> for paper in papers:
...     print(paper['title'])

>>> # Search with multiple sessions
>>> papers = db.search_papers(sessions=["Session 1", "Session 2"])

>>> # Search with years
>>> papers = db.search_papers(years=[2024, 2025])

search_papers_keyword(query, limit=10, sessions=None, years=None, conferences=None)[source]

Perform keyword-based search with filtering and author parsing.

This is a convenience method that wraps search_papers and formats the results for web API consumption, including author parsing.

Parameters:

query (str) – Keyword to search in title, abstract, or keywords fields
limit (int, optional) – Maximum number of results, by default 10
sessions (list of str, optional) – Filter by paper sessions
years (list of int, optional) – Filter by publication years
conferences (list of str, optional) – Filter by conference names

Returns:

List of paper dictionaries with parsed authors

Return type:

list of dict

Examples

>>> papers = db.search_papers_keyword(
...     "neural networks",
...     limit=5,
...     years=[2024, 2025]
... )

get_stats(year=None, conference=None)[source]

Get database statistics, optionally filtered by year and conference.

Parameters:

year (int, optional) – Filter by specific year
conference (str, optional) – Filter by specific conference

Returns:

Statistics dictionary with: - total_papers: int - Number of papers matching filters - year: int or None - Filter year if provided - conference: str or None - Filter conference if provided

Return type:

dict

Examples

>>> stats = db.get_stats()
>>> print(f"Total papers: {stats['total_papers']}")

>>> stats_2024 = db.get_stats(year=2024)
>>> print(f"Papers in 2024: {stats_2024['total_papers']}")

search_authors_in_papers(name=None, limit=100)[source]

Search for authors by name within the papers’ authors field.

Parameters:

name (str, optional) – Name to search for (partial match).
limit (int, default=100) – Maximum number of results to return.

Returns:

Unique authors found in papers with fields: name.

Return type:

list of dict

Raises:

DatabaseError – If search fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     authors = db.search_authors_in_papers(name="Huang")
>>> for author in authors:
...     print(author['name'])

get_author_count()[source]

Get the approximate number of unique authors in the database.

Note: This provides an estimate by counting unique author names across all papers. The actual count may vary.

Returns:: Approximate number of unique authors.
Return type:: int
Raises:: DatabaseError – If query fails.

get_filter_options(year=None, conference=None)[source]

Get distinct values for filterable fields (lightweight schema).

Returns a dictionary with lists of distinct values for session, year, and conference fields that can be used to populate filter dropdowns. Optionally filters by year and/or conference.

Parameters:

year (int, optional) – Filter results to only show options for this year
conference (str, optional) – Filter results to only show options for this conference

Returns:

Dictionary with keys ‘sessions’, ‘years’, ‘conferences’ containing lists of distinct non-null values sorted alphabetically (or numerically for years).

Return type:

dict

Raises:

DatabaseError – If query fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     filters = db.get_filter_options()
>>> print(filters['sessions'])
['Session 1', 'Session 2', ...]
>>> print(filters['years'])
[2023, 2024, 2025]
>>> # Get filters for specific year
>>> filters = db.get_filter_options(year=2025)

get_embedding_model()[source]

Get the embedding model used for the current embeddings.

Returns:: Name of the embedding model, or None if not set.
Return type:: str or None
Raises:: DatabaseError – If query fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     model = db.get_embedding_model()
>>> print(model)
'text-embedding-qwen3-embedding-4b'

set_embedding_model(model_name)[source]

Set the embedding model used for embeddings.

This stores or updates the embedding model metadata. If a record exists, it updates the model and timestamp. Otherwise, it creates a new record.

Parameters:: model_name (str) – Name of the embedding model.
Raises:: DatabaseError – If update fails.
Return type:: None

Examples

>>> db = DatabaseManager()
>>> with db:
...     db.set_embedding_model("text-embedding-qwen3-embedding-4b")

get_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, n_clusters=None, clustering_params=None)[source]

Get cached clustering results matching the parameters.

Parameters:

embedding_model (str) – Name of the embedding model.
reduction_method (str) – Dimensionality reduction method.
n_components (int) – Number of components after reduction.
clustering_method (str) – Clustering algorithm used.
n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).
clustering_params (dict, optional) – Additional clustering parameters (e.g., distance_threshold, eps).

Returns:

Cached clustering results as dictionary, or None if not found.

Return type:

dict or None

Raises:

DatabaseError – If query fails.

save_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, results, n_clusters=None, clustering_params=None)[source]

Save clustering results to cache.

Parameters:

embedding_model (str) – Name of the embedding model.
reduction_method (str) – Dimensionality reduction method.
n_components (int) – Number of components after reduction.
clustering_method (str) – Clustering algorithm used.
results (dict) – Clustering results to cache.
n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).
clustering_params (dict, optional) – Additional clustering parameters.

Raises:

DatabaseError – If save fails.

Return type:

None

clear_clustering_cache(embedding_model=None)[source]

Clear clustering cache, optionally filtered by embedding model.

This is useful when embeddings change or cache becomes stale.

Parameters:: embedding_model (str, optional) – If provided, only clear cache for this embedding model. If None, clear all cache entries.
Returns:: Number of cache entries deleted.
Return type:: int
Raises:: DatabaseError – If deletion fails.

Usage Examples

Basic Operations

from abstracts_explorer.database import DatabaseManager

# Initialize database
db = DatabaseManager()

# Add a paper
paper_id = db.add_paper({
    'openreview_id': 'abc123',
    'title': 'Example Paper',
    'abstract': 'This is an example abstract.',
    'year': 2025,
    'pdf_url': 'https://example.com/paper.pdf'
})

# Add authors
db.add_author(paper_id, 'John Doe', 0)
db.add_author(paper_id, 'Jane Smith', 1)

Searching Papers

# Search by title
results = db.search_papers(title="transformer")

# Search by abstract content
results = db.search_papers(abstract="attention mechanism")

# Filter by year
results = db.get_papers_by_year(2025)

# Get papers by author
results = db.get_papers_by_author("John Doe")

Retrieving Data

# Get all papers
all_papers = db.get_all_papers()

# Get specific paper
paper = db.get_paper_by_id(paper_id)

# Get authors for a paper
authors = db.get_authors_for_paper(paper_id)

Database Schema

papers Table

Column	Type	Description
id	INTEGER	Primary key (auto-increment)
openreview_id	TEXT	Unique OpenReview ID
title	TEXT	Paper title
abstract	TEXT	Paper abstract
year	INTEGER	Conference year
pdf_url	TEXT	URL to PDF
created_at	TIMESTAMP	Creation timestamp

authors Table

Column	Type	Description
id	INTEGER	Primary key (auto-increment)
paper_id	INTEGER	Foreign key to papers
name	TEXT	Author name
position	INTEGER	Author position in list

Error Handling

The database module raises standard SQLite exceptions. Wrap operations in try-except blocks:

try:
    db.add_paper(paper_data)
except sqlite3.IntegrityError:
    print("Paper already exists")
except Exception as e:
    print(f"Database error: {e}")