# Plugin System Abstracts Explorer includes an extensible plugin system that allows you to download papers from multiple sources beyond the official NeurIPS conference. ## Overview The plugin system provides two APIs: 1. **Full Schema API** (`DownloaderPlugin`) - For complex data sources with rich metadata 2. **Lightweight API** (`LightweightDownloaderPlugin`) - For simple workshops and conferences ## Available Plugins ### neurips Official NeurIPS conference data downloader. ```bash uv run abstracts-explorer download --plugin neurips --year 2025 --db-path data/neurips.db ``` - **Years**: 2020-2025 - **Source**: [neurips.cc](https://neurips.cc/) - **Fields**: Full schema (~40+ fields) ### ml4ps ML4PS (Machine Learning for Physical Sciences) workshop downloader. ```bash uv run abstracts-explorer download --plugin ml4ps --year 2025 --db-path data/ml4ps.db ``` - **Years**: 2025 - **Source**: [ML4PS Workshop](https://ml4physicalsciences.github.io/) - **Fields**: Full schema with abstracts from NeurIPS virtual site ## Using Plugins via CLI ### List Available Plugins ```bash uv run abstracts-explorer download --list-plugins ``` ### Download with a Plugin ```bash # Basic usage uv run abstracts-explorer download --plugin ml4ps --year 2025 --db-path data/output.db # With options uv run abstracts-explorer download \ --plugin ml4ps \ --year 2025 \ --db-path data/ml4ps_2025.db \ --fetch-abstracts \ --max-workers 10 ``` ## Creating Your Own Plugin ### When to Use Each API **Use Full Schema API when:** - Downloading from official sources with rich metadata - You need precise control over all ~40+ schema fields - Handling complex data transformations - Source provides detailed information (PDFs, posters, event times, etc.) **Use Lightweight API when:** - Scraping workshops or small conferences - Only essential paper information is available - You want quick plugin development - Minimal boilerplate code ### Lightweight API (Recommended) The Lightweight API requires only 5 fields per paper and optionally supports 8 more. #### Required Fields - `title` (str): Paper title - `authors` (list): List of author names (strings or dicts with 'name' key) - `abstract` (str): Paper abstract - `session` (str): Session/workshop name - `poster_position` (str): Poster identifier or position #### Optional Fields - `id` (int): Paper ID (auto-generated if not provided) - `paper_pdf_url` (str): URL to paper PDF - `poster_image_url` (str): URL to poster image - `url` (str): General URL (OpenReview, ArXiv, etc.) - `room_name` (str): Presentation room - `keywords` (list): Keywords/tags - `starttime` (str): Start time - `endtime` (str): End time - `award` (str): Award name (e.g., "Best Paper Award", "Outstanding Paper") #### Example: Simple Workshop Plugin Create `my_workshop_plugin.py`: ```python from abstracts_explorer.plugins import ( LightweightDownloaderPlugin, convert_lightweight_to_neurips_schema, register_plugin ) import requests from bs4 import BeautifulSoup class MyWorkshopPlugin(LightweightDownloaderPlugin): """Downloader for My Workshop.""" plugin_name = "myworkshop" plugin_description = "My Workshop downloader" supported_years = [2024, 2025] def download(self, year=None, output_path=None, force_download=False, **kwargs): """Download papers from My Workshop website.""" self.validate_year(year) # Scrape workshop website url = f'https://myworkshop.com/{year}/papers' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') papers = [] for paper_elem in soup.find_all('div', class_='paper'): paper = { # Required fields 'title': paper_elem.find('h2', class_='title').text.strip(), 'authors': [ a.text.strip() for a in paper_elem.find_all('span', class_='author') ], 'abstract': paper_elem.find('p', class_='abstract').text.strip(), 'session': paper_elem.get('data-session', 'Main Track'), 'poster_position': paper_elem.get('data-poster', 'TBD'), # Optional fields 'paper_pdf_url': paper_elem.find('a', class_='pdf')['href'] if paper_elem.find('a', class_='pdf') else None, 'keywords': [ tag.text.strip() for tag in paper_elem.find_all('span', class_='tag') ], } papers.append(paper) # Convert to full NeurIPS schema return convert_lightweight_to_neurips_schema( papers, session_default=f'My Workshop {year}', event_type='Workshop Poster', source_url=url ) def get_metadata(self): """Return plugin metadata.""" return { 'name': self.plugin_name, 'description': self.plugin_description, 'supported_years': self.supported_years } # Auto-register the plugin def _register(): register_plugin(MyWorkshopPlugin()) _register() ``` #### Using Your Plugin ```python # Import to register from my_workshop_plugin import MyWorkshopPlugin # Or use via CLI # uv run abstracts-explorer download --plugin myworkshop --year 2025 --db-path data/workshop.db ``` #### Flexible Author Format The Lightweight API accepts authors in multiple formats: ```python # Simple strings 'authors': ['John Doe', 'Jane Smith'] # Dicts with name 'authors': [ {'name': 'John Doe', 'affiliation': 'MIT'}, {'name': 'Jane Smith', 'affiliation': 'Stanford'} ] # Mixed (will be converted to strings) 'authors': [ 'John Doe', {'name': 'Jane Smith'} ] ``` ### Full Schema API For more complex cases where you need complete control over all fields. #### Example: Advanced Plugin ```python from abstracts_explorer.plugins import DownloaderPlugin, register_plugin import requests class AdvancedPlugin(DownloaderPlugin): """Advanced plugin with full schema control.""" plugin_name = "advanced" plugin_description = "Advanced conference downloader" supported_years = [2024, 2025] def download(self, year=None, output_path=None, force_download=False, **kwargs): """Download papers with full schema.""" self.validate_year(year) # Your complex data fetching logic papers_data = self._fetch_from_api(year) # Build full schema results = [] for raw_paper in papers_data: paper = { # Core fields 'id': raw_paper['id'], 'name': raw_paper['title'], 'abstract': raw_paper['abstract'], # Author information 'authors': [ { 'name': author['full_name'], 'institution': author.get('affiliation', ''), 'email': author.get('email', '') } for author in raw_paper['authors'] ], # Event information 'session': raw_paper.get('session_name', f'Conference {year}'), 'event_type': raw_paper.get('presentation_type', 'Poster'), 'poster_position': raw_paper.get('poster_id', 'TBD'), 'room_name': raw_paper.get('room', ''), # URLs and media 'paper_pdf_url': raw_paper.get('pdf_link'), 'poster_image_url': raw_paper.get('poster_image'), 'openreview_url': raw_paper.get('openreview_link'), # Metadata 'keywords': raw_paper.get('keywords', []), 'tldr': raw_paper.get('summary', ''), # Timestamps 'starttime': raw_paper.get('start_time'), 'endtime': raw_paper.get('end_time'), # Additional fields... } results.append(paper) return { 'count': len(results), 'next': None, 'previous': None, 'results': results } def get_metadata(self): """Return plugin metadata.""" return { 'name': self.plugin_name, 'description': self.plugin_description, 'supported_years': self.supported_years } def _fetch_from_api(self, year): """Fetch data from external API.""" response = requests.get(f'https://api.example.com/papers?year={year}') return response.json()['papers'] def _register(): register_plugin(AdvancedPlugin()) _register() ``` ## Schema Conversion The `convert_lightweight_to_neurips_schema()` function automatically converts lightweight format to the full NeurIPS schema: ```python from abstracts_explorer.plugins import convert_lightweight_to_neurips_schema papers = [ { 'title': 'Paper Title', 'authors': ['John Doe', 'Jane Smith'], 'abstract': 'Paper abstract...', 'session': 'Morning Session', 'poster_position': 'A1', 'paper_pdf_url': 'https://example.com/paper.pdf', } ] full_schema_data = convert_lightweight_to_neurips_schema( papers, session_default='Workshop 2025', event_type='Workshop Poster', source_url='https://workshop.com/2025' ) # Result has full schema with ~40+ fields print(full_schema_data['count']) # 1 print(full_schema_data['results'][0]['name']) # 'Paper Title' print(full_schema_data['results'][0]['eventmedia']) # Generated from URLs ``` ### Converter Parameters - `papers` (list): List of lightweight format papers - `session_default` (str): Default session name if not provided - `event_type` (str): Default event type (e.g., 'Workshop Poster', 'Conference Talk') - `source_url` (str, optional): Source URL for reference ## Plugin Installation ### From Package If your plugin is in the package's `src/abstracts_explorer/plugins/` directory, it will be auto-discovered when the package loads. ### External Plugin For plugins outside the package: ```python # In your code from abstracts_explorer.plugins import register_plugin from my_external_plugin import MyPlugin register_plugin(MyPlugin()) ``` Or set `PYTHONPATH`: ```bash export PYTHONPATH=/path/to/plugins:$PYTHONPATH uv run abstracts-explorer download --plugin myplugin --year 2025 ``` ## Testing Your Plugin ### Unit Test Example ```python import pytest from my_plugin import MyPlugin def test_plugin_metadata(): """Test plugin metadata.""" plugin = MyPlugin() metadata = plugin.get_metadata() assert metadata['name'] == 'myplugin' assert 2025 in metadata['supported_years'] def test_plugin_download(): """Test plugin download.""" plugin = MyPlugin() data = plugin.download(year=2025) # Check schema assert 'count' in data assert 'results' in data assert data['count'] == len(data['results']) # Check paper structure if data['results']: paper = data['results'][0] assert 'name' in paper assert 'abstract' in paper assert 'authors' in paper def test_invalid_year(): """Test invalid year handling.""" plugin = MyPlugin() with pytest.raises(ValueError): plugin.download(year=2099) ``` ### Manual Testing ```python from my_plugin import MyPlugin # Create instance plugin = MyPlugin() # Test metadata print(plugin.get_metadata()) # Test download data = plugin.download(year=2025) print(f"Downloaded {data['count']} papers") # Verify first paper if data['results']: paper = data['results'][0] print(f"Title: {paper['name']}") print(f"Authors: {len(paper['authors'])} authors") print(f"Abstract length: {len(paper['abstract'])} chars") ``` ## Best Practices ### 1. Error Handling ```python def download(self, year=None, **kwargs): self.validate_year(year) try: response = requests.get(url, timeout=30) response.raise_for_status() except requests.RequestException as e: raise RuntimeError(f"Failed to fetch data: {e}") try: data = response.json() except ValueError as e: raise RuntimeError(f"Invalid JSON response: {e}") ``` ### 2. Caching ```python def download(self, year=None, output_path=None, force_download=False, **kwargs): # Check if already downloaded if output_path and Path(output_path).exists() and not force_download: logger.info(f"Data already exists at {output_path}") # Load and return cached data return self._load_cached_data(output_path) # Download fresh data return self._fetch_from_source(year) ``` ### 3. Logging ```python import logging logger = logging.getLogger(__name__) class MyPlugin(LightweightDownloaderPlugin): def download(self, year=None, **kwargs): logger.info(f"Starting download for {year}") papers = self._scrape_papers(year) logger.info(f"Found {len(papers)} papers") return convert_lightweight_to_neurips_schema(papers, ...) ``` ### 4. Progress Indication ```python from tqdm import tqdm def _scrape_papers(self, year): paper_links = self._get_paper_links(year) papers = [] for link in tqdm(paper_links, desc="Scraping papers"): paper = self._scrape_single_paper(link) papers.append(paper) return papers ``` ### 5. Rate Limiting ```python import time def _scrape_papers(self, year, delay=1.0): papers = [] for link in paper_links: paper = self._scrape_single_paper(link) papers.append(paper) time.sleep(delay) # Be nice to the server return papers ``` ## API Comparison | Feature | Lightweight API | Full Schema API | | ------------------- | ---------------------------- | ------------------------------- | | **Required Fields** | 5 fields | ~15 fields | | **Total Fields** | 13 fields | ~40+ fields | | **Complexity** | Low | High | | **Setup Time** | Minutes | Hours | | **Use Case** | Workshops, small conferences | Official conferences, rich data | | **Auto-conversion** | Yes | N/A | | **Author Format** | Flexible (strings/dicts) | Strict dict format | | **Validation** | Automatic | Manual | ## See Also - [CLI Reference](cli_reference.md) - Command-line interface documentation - [Usage Guide](usage.md) - General package usage - [Plugin README](../src/abstracts_explorer/plugins/README.md) - Technical plugin guide