Database Module

The database module provides an interface for storing and retrieving conference papers in SQLite.

Overview

The DatabaseManager class handles all database operations including:

  • Creating and managing database schema

  • Adding papers and authors

  • Searching and filtering papers

  • Managing paper-author relationships

Class Reference

Database Module

This module provides functionality to load JSON data into a SQL database. Supports both SQLite and PostgreSQL backends via SQLAlchemy.

exception abstracts_explorer.database.DatabaseError[source]

Bases: Exception

Exception raised for database operations.

class abstracts_explorer.database.DatabaseManager[source]

Bases: object

Manager for SQL database operations using SQLAlchemy.

Supports SQLite and PostgreSQL backends through SQLAlchemy connection URLs. Database configuration is read from the config file (PAPER_DB variable).

database_url

SQLAlchemy database URL from configuration.

Type:

str

engine

SQLAlchemy engine instance.

Type:

Engine or None

SessionLocal

SQLAlchemy session factory.

Type:

sessionmaker or None

_session

Active database session if connected.

Type:

Session or None

Examples

>>> # Database configuration comes from config file
>>> db = DatabaseManager()
>>> db.connect()
>>> db.create_tables()
>>> db.close()
__init__()[source]

Initialize the DatabaseManager.

Reads database configuration from the config file.

connect()[source]

Connect to the database.

Creates the database file if it doesn’t exist (SQLite only).

Raises:

DatabaseError – If connection fails.

Return type:

None

close()[source]

Close the database connection.

Does nothing if not connected.

Return type:

None

__enter__()[source]

Context manager entry.

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit.

create_tables()[source]

Create database tables for papers and embeddings metadata.

Creates the following tables: - papers: Main table for paper information with lightweight ML4PS schema - embeddings_metadata: Metadata about embeddings (model used, creation date) - clustering_cache: Cache for clustering results

This method is idempotent - it can be called multiple times without error. Tables are only created if they don’t already exist.

Raises:

DatabaseError – If table creation fails.

Return type:

None

add_paper(paper)[source]

Add a single paper to the database.

Parameters:

paper (LightweightPaper) – Validated paper object to insert.

Returns:

The UID of the inserted paper, or None if paper was skipped (duplicate).

Return type:

str or None

Raises:

DatabaseError – If insertion fails.

Examples

>>> from abstracts_explorer.plugin import LightweightPaper
>>> db = DatabaseManager()
>>> with db:
...     db.create_tables()
...     paper = LightweightPaper(
...         title="Test Paper",
...         authors=["John Doe"],
...         abstract="Test abstract",
...         session="Session 1",
...         poster_position="P1",
...         year=2025,
...         conference="NeurIPS"
...     )
...     paper_uid = db.add_paper(paper)
>>> print(f"Inserted paper with UID: {paper_uid}")
add_papers(papers)[source]

Add multiple papers to the database in a batch.

Parameters:

papers (list of LightweightPaper) – List of validated paper objects to insert.

Returns:

Number of papers successfully inserted (excludes duplicates).

Return type:

int

Raises:

DatabaseError – If batch insertion fails.

Examples

>>> from abstracts_explorer.plugin import LightweightPaper
>>> db = DatabaseManager()
>>> with db:
...     db.create_tables()
...     papers = [
...         LightweightPaper(
...             title="Paper 1",
...             authors=["Author 1"],
...             abstract="Abstract 1",
...             session="Session 1",
...             poster_position="P1",
...             year=2025,
...             conference="NeurIPS"
...         ),
...         LightweightPaper(
...             title="Paper 2",
...             authors=["Author 2"],
...             abstract="Abstract 2",
...             session="Session 2",
...             poster_position="P2",
...             year=2025,
...             conference="NeurIPS"
...         )
...     ]
...     count = db.add_papers(papers)
>>> print(f"Inserted {count} papers")
donate_validation_data(paper_priorities)[source]

Store donated paper rating data for validation purposes.

This method accepts anonymized paper ratings from users and stores them in the validation_data table for improving the service.

Parameters:

paper_priorities (Dict[str, Dict[str, Any]]) – Dictionary mapping paper UIDs to priority data. Each priority data dict must contain: - priority (int): Rating value - searchTerm (str, optional): Search term associated with the rating

Returns:

Number of papers successfully donated

Return type:

int

Raises:
  • ValueError – If paper_priorities is empty or contains invalid data format

  • DatabaseError – If database operation fails

Examples

>>> db = DatabaseManager()
>>> with db:
...     priorities = {
...         "abc123": {"priority": 5, "searchTerm": "machine learning"},
...         "def456": {"priority": 4, "searchTerm": "deep learning"}
...     }
...     count = db.donate_validation_data(priorities)
...     print(f"Donated {count} papers")
query(sql, parameters=())[source]

Execute a SQL query and return results.

Note: This method provides backward compatibility with raw SQL queries. For new code, prefer using SQLAlchemy ORM methods.

Parameters:
  • sql (str) – SQL query to execute (use named parameters like :param1, :param2).

  • parameters (tuple, optional) – Query parameters for parameterized queries.

Returns:

Query results as list of dictionaries.

Return type:

list of dict

Raises:

DatabaseError – If query execution fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     results = db.query("SELECT * FROM papers WHERE session = ?", ("Poster",))
>>> for row in results:
...     print(row['title'])
get_paper_count()[source]

Get the total number of papers in the database.

Returns:

Number of papers.

Return type:

int

Raises:

DatabaseError – If query fails.

search_papers(keyword=None, session=None, sessions=None, year=None, years=None, conference=None, conferences=None, limit=100)[source]

Search for papers by various criteria (lightweight schema).

Parameters:
  • keyword (str, optional) – Keyword to search in title, abstract, or keywords fields.

  • session (str, optional) – Single session to filter by (deprecated, use sessions instead).

  • sessions (list[str], optional) – List of sessions to filter by (matches ANY).

  • year (int, optional) – Single year to filter by (deprecated, use years instead).

  • years (list[int], optional) – List of years to filter by (matches ANY).

  • conference (str, optional) – Single conference to filter by (deprecated, use conferences instead).

  • conferences (list[str], optional) – List of conferences to filter by (matches ANY).

  • limit (int, default=100) – Maximum number of results to return.

Returns:

Matching papers as dictionaries.

Return type:

list of dict

Raises:

DatabaseError – If search fails.

Examples

>>> db = DatabaseManager("neurips.db")
>>> with db:
...     papers = db.search_papers(keyword="neural network", limit=10)
>>> for paper in papers:
...     print(paper['title'])
>>> # Search with multiple sessions
>>> papers = db.search_papers(sessions=["Session 1", "Session 2"])
>>> # Search with years
>>> papers = db.search_papers(years=[2024, 2025])
search_papers_keyword(query, limit=10, sessions=None, years=None, conferences=None)[source]

Perform keyword-based search with filtering and author parsing.

This is a convenience method that wraps search_papers and formats the results for web API consumption, including author parsing.

Parameters:
  • query (str) – Keyword to search in title, abstract, or keywords fields

  • limit (int, optional) – Maximum number of results, by default 10

  • sessions (list of str, optional) – Filter by paper sessions

  • years (list of int, optional) – Filter by publication years

  • conferences (list of str, optional) – Filter by conference names

Returns:

List of paper dictionaries with parsed authors

Return type:

list of dict

Examples

>>> papers = db.search_papers_keyword(
...     "neural networks",
...     limit=5,
...     years=[2024, 2025]
... )
get_stats(year=None, conference=None)[source]

Get database statistics, optionally filtered by year and conference.

Parameters:
  • year (int, optional) – Filter by specific year

  • conference (str, optional) – Filter by specific conference

Returns:

Statistics dictionary with: - total_papers: int - Number of papers matching filters - year: int or None - Filter year if provided - conference: str or None - Filter conference if provided

Return type:

dict

Examples

>>> stats = db.get_stats()
>>> print(f"Total papers: {stats['total_papers']}")
>>> stats_2024 = db.get_stats(year=2024)
>>> print(f"Papers in 2024: {stats_2024['total_papers']}")
search_authors_in_papers(name=None, limit=100)[source]

Search for authors by name within the papers’ authors field.

Parameters:
  • name (str, optional) – Name to search for (partial match).

  • limit (int, default=100) – Maximum number of results to return.

Returns:

Unique authors found in papers with fields: name.

Return type:

list of dict

Raises:

DatabaseError – If search fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     authors = db.search_authors_in_papers(name="Huang")
>>> for author in authors:
...     print(author['name'])
get_author_count()[source]

Get the approximate number of unique authors in the database.

Note: This provides an estimate by counting unique author names across all papers. The actual count may vary.

Returns:

Approximate number of unique authors.

Return type:

int

Raises:

DatabaseError – If query fails.

get_filter_options(year=None, conference=None)[source]

Get distinct values for filterable fields (lightweight schema).

Returns a dictionary with lists of distinct values for session, year, and conference fields that can be used to populate filter dropdowns. Optionally filters by year and/or conference.

Parameters:
  • year (int, optional) – Filter results to only show options for this year

  • conference (str, optional) – Filter results to only show options for this conference

Returns:

Dictionary with keys ‘sessions’, ‘years’, ‘conferences’ containing lists of distinct non-null values sorted alphabetically (or numerically for years).

Return type:

dict

Raises:

DatabaseError – If query fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     filters = db.get_filter_options()
>>> print(filters['sessions'])
['Session 1', 'Session 2', ...]
>>> print(filters['years'])
[2023, 2024, 2025]
>>> # Get filters for specific year
>>> filters = db.get_filter_options(year=2025)
get_embedding_model()[source]

Get the embedding model used for the current embeddings.

Returns:

Name of the embedding model, or None if not set.

Return type:

str or None

Raises:

DatabaseError – If query fails.

Examples

>>> db = DatabaseManager()
>>> with db:
...     model = db.get_embedding_model()
>>> print(model)
'text-embedding-qwen3-embedding-4b'
set_embedding_model(model_name)[source]

Set the embedding model used for embeddings.

This stores or updates the embedding model metadata. If a record exists, it updates the model and timestamp. Otherwise, it creates a new record.

Parameters:

model_name (str) – Name of the embedding model.

Raises:

DatabaseError – If update fails.

Return type:

None

Examples

>>> db = DatabaseManager()
>>> with db:
...     db.set_embedding_model("text-embedding-qwen3-embedding-4b")
get_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, n_clusters=None, clustering_params=None)[source]

Get cached clustering results matching the parameters.

Parameters:
  • embedding_model (str) – Name of the embedding model.

  • reduction_method (str) – Dimensionality reduction method.

  • n_components (int) – Number of components after reduction.

  • clustering_method (str) – Clustering algorithm used.

  • n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).

  • clustering_params (dict, optional) – Additional clustering parameters (e.g., distance_threshold, eps).

Returns:

Cached clustering results as dictionary, or None if not found.

Return type:

dict or None

Raises:

DatabaseError – If query fails.

save_clustering_cache(embedding_model, reduction_method, n_components, clustering_method, results, n_clusters=None, clustering_params=None)[source]

Save clustering results to cache.

Parameters:
  • embedding_model (str) – Name of the embedding model.

  • reduction_method (str) – Dimensionality reduction method.

  • n_components (int) – Number of components after reduction.

  • clustering_method (str) – Clustering algorithm used.

  • results (dict) – Clustering results to cache.

  • n_clusters (int, optional) – Number of clusters (for kmeans/agglomerative).

  • clustering_params (dict, optional) – Additional clustering parameters.

Raises:

DatabaseError – If save fails.

Return type:

None

clear_clustering_cache(embedding_model=None)[source]

Clear clustering cache, optionally filtered by embedding model.

This is useful when embeddings change or cache becomes stale.

Parameters:

embedding_model (str, optional) – If provided, only clear cache for this embedding model. If None, clear all cache entries.

Returns:

Number of cache entries deleted.

Return type:

int

Raises:

DatabaseError – If deletion fails.

Usage Examples

Basic Operations

from abstracts_explorer.database import DatabaseManager

# Initialize database
db = DatabaseManager()

# Add a paper
paper_id = db.add_paper({
    'openreview_id': 'abc123',
    'title': 'Example Paper',
    'abstract': 'This is an example abstract.',
    'year': 2025,
    'pdf_url': 'https://example.com/paper.pdf'
})

# Add authors
db.add_author(paper_id, 'John Doe', 0)
db.add_author(paper_id, 'Jane Smith', 1)

Searching Papers

# Search by title
results = db.search_papers(title="transformer")

# Search by abstract content
results = db.search_papers(abstract="attention mechanism")

# Filter by year
results = db.get_papers_by_year(2025)

# Get papers by author
results = db.get_papers_by_author("John Doe")

Retrieving Data

# Get all papers
all_papers = db.get_all_papers()

# Get specific paper
paper = db.get_paper_by_id(paper_id)

# Get authors for a paper
authors = db.get_authors_for_paper(paper_id)

Database Schema

papers Table

Column

Type

Description

id

INTEGER

Primary key (auto-increment)

openreview_id

TEXT

Unique OpenReview ID

title

TEXT

Paper title

abstract

TEXT

Paper abstract

year

INTEGER

Conference year

pdf_url

TEXT

URL to PDF

created_at

TIMESTAMP

Creation timestamp

authors Table

Column

Type

Description

id

INTEGER

Primary key (auto-increment)

paper_id

INTEGER

Foreign key to papers

name

TEXT

Author name

position

INTEGER

Author position in list

Error Handling

The database module raises standard SQLite exceptions. Wrap operations in try-except blocks:

try:
    db.add_paper(paper_data)
except sqlite3.IntegrityError:
    print("Paper already exists")
except Exception as e:
    print(f"Database error: {e}")