Plugin System
Abstracts Explorer includes an extensible plugin system that allows you to download papers from multiple sources beyond the official NeurIPS conference.
Overview
The plugin system provides two APIs:
Full Schema API (
DownloaderPlugin) - For complex data sources with rich metadataLightweight API (
LightweightDownloaderPlugin) - For simple workshops and conferences
Available Plugins
neurips
Official NeurIPS conference data downloader.
uv run abstracts-explorer download --plugin neurips --year 2025 --db-path data/neurips.db
Years: 2020-2025
Source: neurips.cc
Fields: Full schema (~40+ fields)
ml4ps
ML4PS (Machine Learning for Physical Sciences) workshop downloader.
uv run abstracts-explorer download --plugin ml4ps --year 2025 --db-path data/ml4ps.db
Years: 2025
Source: ML4PS Workshop
Fields: Full schema with abstracts from NeurIPS virtual site
Using Plugins via CLI
List Available Plugins
uv run abstracts-explorer download --list-plugins
Download with a Plugin
# Basic usage
uv run abstracts-explorer download --plugin ml4ps --year 2025 --db-path data/output.db
# With options
uv run abstracts-explorer download \
--plugin ml4ps \
--year 2025 \
--db-path data/ml4ps_2025.db \
--fetch-abstracts \
--max-workers 10
Creating Your Own Plugin
When to Use Each API
Use Full Schema API when:
Downloading from official sources with rich metadata
You need precise control over all ~40+ schema fields
Handling complex data transformations
Source provides detailed information (PDFs, posters, event times, etc.)
Use Lightweight API when:
Scraping workshops or small conferences
Only essential paper information is available
You want quick plugin development
Minimal boilerplate code
Lightweight API (Recommended)
The Lightweight API requires only 5 fields per paper and optionally supports 8 more.
Required Fields
title(str): Paper titleauthors(list): List of author names (strings or dicts with ‘name’ key)abstract(str): Paper abstractsession(str): Session/workshop nameposter_position(str): Poster identifier or position
Optional Fields
id(int): Paper ID (auto-generated if not provided)paper_pdf_url(str): URL to paper PDFposter_image_url(str): URL to poster imageurl(str): General URL (OpenReview, ArXiv, etc.)room_name(str): Presentation roomkeywords(list): Keywords/tagsstarttime(str): Start timeendtime(str): End timeaward(str): Award name (e.g., “Best Paper Award”, “Outstanding Paper”)
Example: Simple Workshop Plugin
Create my_workshop_plugin.py:
from abstracts_explorer.plugins import (
LightweightDownloaderPlugin,
convert_lightweight_to_neurips_schema,
register_plugin
)
import requests
from bs4 import BeautifulSoup
class MyWorkshopPlugin(LightweightDownloaderPlugin):
"""Downloader for My Workshop."""
plugin_name = "myworkshop"
plugin_description = "My Workshop downloader"
supported_years = [2024, 2025]
def download(self, year=None, output_path=None, force_download=False, **kwargs):
"""Download papers from My Workshop website."""
self.validate_year(year)
# Scrape workshop website
url = f'https://myworkshop.com/{year}/papers'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
papers = []
for paper_elem in soup.find_all('div', class_='paper'):
paper = {
# Required fields
'title': paper_elem.find('h2', class_='title').text.strip(),
'authors': [
a.text.strip()
for a in paper_elem.find_all('span', class_='author')
],
'abstract': paper_elem.find('p', class_='abstract').text.strip(),
'session': paper_elem.get('data-session', 'Main Track'),
'poster_position': paper_elem.get('data-poster', 'TBD'),
# Optional fields
'paper_pdf_url': paper_elem.find('a', class_='pdf')['href'] if paper_elem.find('a', class_='pdf') else None,
'keywords': [
tag.text.strip()
for tag in paper_elem.find_all('span', class_='tag')
],
}
papers.append(paper)
# Convert to full NeurIPS schema
return convert_lightweight_to_neurips_schema(
papers,
session_default=f'My Workshop {year}',
event_type='Workshop Poster',
source_url=url
)
def get_metadata(self):
"""Return plugin metadata."""
return {
'name': self.plugin_name,
'description': self.plugin_description,
'supported_years': self.supported_years
}
# Auto-register the plugin
def _register():
register_plugin(MyWorkshopPlugin())
_register()
Using Your Plugin
# Import to register
from my_workshop_plugin import MyWorkshopPlugin
# Or use via CLI
# uv run abstracts-explorer download --plugin myworkshop --year 2025 --db-path data/workshop.db
Full Schema API
For more complex cases where you need complete control over all fields.
Example: Advanced Plugin
from abstracts_explorer.plugins import DownloaderPlugin, register_plugin
import requests
class AdvancedPlugin(DownloaderPlugin):
"""Advanced plugin with full schema control."""
plugin_name = "advanced"
plugin_description = "Advanced conference downloader"
supported_years = [2024, 2025]
def download(self, year=None, output_path=None, force_download=False, **kwargs):
"""Download papers with full schema."""
self.validate_year(year)
# Your complex data fetching logic
papers_data = self._fetch_from_api(year)
# Build full schema
results = []
for raw_paper in papers_data:
paper = {
# Core fields
'id': raw_paper['id'],
'name': raw_paper['title'],
'abstract': raw_paper['abstract'],
# Author information
'authors': [
{
'name': author['full_name'],
'institution': author.get('affiliation', ''),
'email': author.get('email', '')
}
for author in raw_paper['authors']
],
# Event information
'session': raw_paper.get('session_name', f'Conference {year}'),
'event_type': raw_paper.get('presentation_type', 'Poster'),
'poster_position': raw_paper.get('poster_id', 'TBD'),
'room_name': raw_paper.get('room', ''),
# URLs and media
'paper_pdf_url': raw_paper.get('pdf_link'),
'poster_image_url': raw_paper.get('poster_image'),
'openreview_url': raw_paper.get('openreview_link'),
# Metadata
'keywords': raw_paper.get('keywords', []),
'tldr': raw_paper.get('summary', ''),
# Timestamps
'starttime': raw_paper.get('start_time'),
'endtime': raw_paper.get('end_time'),
# Additional fields...
}
results.append(paper)
return {
'count': len(results),
'next': None,
'previous': None,
'results': results
}
def get_metadata(self):
"""Return plugin metadata."""
return {
'name': self.plugin_name,
'description': self.plugin_description,
'supported_years': self.supported_years
}
def _fetch_from_api(self, year):
"""Fetch data from external API."""
response = requests.get(f'https://api.example.com/papers?year={year}')
return response.json()['papers']
def _register():
register_plugin(AdvancedPlugin())
_register()
Schema Conversion
The convert_lightweight_to_neurips_schema() function automatically converts lightweight format to the full NeurIPS schema:
from abstracts_explorer.plugins import convert_lightweight_to_neurips_schema
papers = [
{
'title': 'Paper Title',
'authors': ['John Doe', 'Jane Smith'],
'abstract': 'Paper abstract...',
'session': 'Morning Session',
'poster_position': 'A1',
'paper_pdf_url': 'https://example.com/paper.pdf',
}
]
full_schema_data = convert_lightweight_to_neurips_schema(
papers,
session_default='Workshop 2025',
event_type='Workshop Poster',
source_url='https://workshop.com/2025'
)
# Result has full schema with ~40+ fields
print(full_schema_data['count']) # 1
print(full_schema_data['results'][0]['name']) # 'Paper Title'
print(full_schema_data['results'][0]['eventmedia']) # Generated from URLs
Converter Parameters
papers(list): List of lightweight format paperssession_default(str): Default session name if not providedevent_type(str): Default event type (e.g., ‘Workshop Poster’, ‘Conference Talk’)source_url(str, optional): Source URL for reference
Plugin Installation
From Package
If your plugin is in the package’s src/abstracts_explorer/plugins/ directory, it will be auto-discovered when the package loads.
External Plugin
For plugins outside the package:
# In your code
from abstracts_explorer.plugins import register_plugin
from my_external_plugin import MyPlugin
register_plugin(MyPlugin())
Or set PYTHONPATH:
export PYTHONPATH=/path/to/plugins:$PYTHONPATH
uv run abstracts-explorer download --plugin myplugin --year 2025
Testing Your Plugin
Unit Test Example
import pytest
from my_plugin import MyPlugin
def test_plugin_metadata():
"""Test plugin metadata."""
plugin = MyPlugin()
metadata = plugin.get_metadata()
assert metadata['name'] == 'myplugin'
assert 2025 in metadata['supported_years']
def test_plugin_download():
"""Test plugin download."""
plugin = MyPlugin()
data = plugin.download(year=2025)
# Check schema
assert 'count' in data
assert 'results' in data
assert data['count'] == len(data['results'])
# Check paper structure
if data['results']:
paper = data['results'][0]
assert 'name' in paper
assert 'abstract' in paper
assert 'authors' in paper
def test_invalid_year():
"""Test invalid year handling."""
plugin = MyPlugin()
with pytest.raises(ValueError):
plugin.download(year=2099)
Manual Testing
from my_plugin import MyPlugin
# Create instance
plugin = MyPlugin()
# Test metadata
print(plugin.get_metadata())
# Test download
data = plugin.download(year=2025)
print(f"Downloaded {data['count']} papers")
# Verify first paper
if data['results']:
paper = data['results'][0]
print(f"Title: {paper['name']}")
print(f"Authors: {len(paper['authors'])} authors")
print(f"Abstract length: {len(paper['abstract'])} chars")
Best Practices
1. Error Handling
def download(self, year=None, **kwargs):
self.validate_year(year)
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
except requests.RequestException as e:
raise RuntimeError(f"Failed to fetch data: {e}")
try:
data = response.json()
except ValueError as e:
raise RuntimeError(f"Invalid JSON response: {e}")
2. Caching
def download(self, year=None, output_path=None, force_download=False, **kwargs):
# Check if already downloaded
if output_path and Path(output_path).exists() and not force_download:
logger.info(f"Data already exists at {output_path}")
# Load and return cached data
return self._load_cached_data(output_path)
# Download fresh data
return self._fetch_from_source(year)
3. Logging
import logging
logger = logging.getLogger(__name__)
class MyPlugin(LightweightDownloaderPlugin):
def download(self, year=None, **kwargs):
logger.info(f"Starting download for {year}")
papers = self._scrape_papers(year)
logger.info(f"Found {len(papers)} papers")
return convert_lightweight_to_neurips_schema(papers, ...)
4. Progress Indication
from tqdm import tqdm
def _scrape_papers(self, year):
paper_links = self._get_paper_links(year)
papers = []
for link in tqdm(paper_links, desc="Scraping papers"):
paper = self._scrape_single_paper(link)
papers.append(paper)
return papers
5. Rate Limiting
import time
def _scrape_papers(self, year, delay=1.0):
papers = []
for link in paper_links:
paper = self._scrape_single_paper(link)
papers.append(paper)
time.sleep(delay) # Be nice to the server
return papers
API Comparison
Feature |
Lightweight API |
Full Schema API |
|---|---|---|
Required Fields |
5 fields |
~15 fields |
Total Fields |
13 fields |
~40+ fields |
Complexity |
Low |
High |
Setup Time |
Minutes |
Hours |
Use Case |
Workshops, small conferences |
Official conferences, rich data |
Auto-conversion |
Yes |
N/A |
Author Format |
Flexible (strings/dicts) |
Strict dict format |
Validation |
Automatic |
Manual |
See Also
CLI Reference - Command-line interface documentation
Usage Guide - General package usage
Plugin README - Technical plugin guide