RAG Module
The RAG (Retrieval-Augmented Generation) module provides a chat interface for querying papers using LLMs.
Overview
The RAGChat class implements:
Retrieval-Augmented Generation for paper queries
Conversation history management
Integration with LM Studio LLM backend
Context building from relevant papers
NEW: MCP Clustering Tools Integration - Automatic tool calling for topic analysis
Class Reference
Retrieval Augmented Generation (RAG) for NeurIPS abstracts.
This module provides RAG functionality to query papers and generate contextual responses using OpenAI-compatible language model APIs.
- exception abstracts_explorer.rag.RAGError[source]
Bases:
ExceptionException raised for RAG-related errors.
- class abstracts_explorer.rag.RAGChat(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]
Bases:
objectRAG chat interface for querying NeurIPS papers.
Uses embeddings for semantic search and OpenAI-compatible API for response generation. Optionally integrates with MCP clustering tools to answer questions about conference topics, trends, and developments.
- Parameters:
embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.
database (DatabaseManager) – Database instance for querying paper details. REQUIRED.
lm_studio_url (str, optional) – URL for OpenAI-compatible API, by default “http://localhost:1234”
model (str, optional) – Name of the language model, by default “auto”
max_context_papers (int, optional) – Maximum number of papers to include in context, by default 5
temperature (float, optional) – Sampling temperature for generation, by default 0.7
enable_mcp_tools (bool, optional) – Enable MCP clustering tools for topic analysis, by default True
Examples
>>> from abstracts_explorer.embeddings import EmbeddingsManager >>> from abstracts_explorer.database import DatabaseManager >>> em = EmbeddingsManager() >>> em.connect() >>> db = DatabaseManager() >>> db.connect() >>> chat = RAGChat(em, db) >>> >>> # Ask about specific papers >>> response = chat.query("What are the latest advances in deep learning?") >>> print(response) >>> >>> # Ask about conference topics (uses MCP tools automatically) >>> response = chat.query("What are the main research topics at NeurIPS?") >>> print(response)
- __init__(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]
Initialize RAG chat.
Parameters are optional and will use values from environment/config if not provided.
- Parameters:
embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.
database (DatabaseManager) – Database instance for querying paper details. REQUIRED - no fallback allowed.
lm_studio_url (str, optional) – URL for OpenAI-compatible API. If None, uses config value.
model (str, optional) – Name of the language model. If None, uses config value.
max_context_papers (int, optional) – Maximum number of papers to include in context. If None, uses config value.
temperature (float, optional) – Sampling temperature for generation. If None, uses config value.
enable_mcp_tools (bool, optional) – Whether to enable MCP clustering tools for the LLM. Default is True.
- Raises:
RAGError – If required parameters are missing or invalid.
- property openai_client: OpenAI
Get the OpenAI client, creating it lazily on first access.
This lazy loading prevents API calls during test collection.
- Returns:
Initialized OpenAI client instance.
- Return type:
OpenAI
- query(question, n_results=None, metadata_filter=None, system_prompt=None)[source]
Query the RAG system with a question.
- Parameters:
- Returns:
Dictionary containing: - response: str - Generated response - papers: list - Retrieved papers used as context - metadata: dict - Additional metadata
- Return type:
- Raises:
RAGError – If query fails.
Examples
>>> response = chat.query("What is attention mechanism?") >>> print(response["response"]) >>> print(f"Based on {len(response['papers'])} papers")
- chat(message, use_context=True, n_results=None)[source]
Continue conversation with context awareness.
- Parameters:
- Returns:
Dictionary containing response and metadata.
- Return type:
Examples
>>> response = chat.chat("Tell me more about transformers") >>> print(response["response"])
Usage Examples
Basic Setup
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.rag import RAGChat
# Initialize embeddings manager
em = EmbeddingsManager()
# Initialize RAG chat
chat = RAGChat(
embeddings_manager=em,
lm_studio_url="http://localhost:1234",
model="gemma-3-4b-it-qat"
)
Simple Query
# Ask a question
response = chat.query(
"What are the latest developments in transformer architectures?"
)
print(response)
Conversation
# Start a conversation
response1 = chat.query("Tell me about vision transformers")
print(response1)
# Continue the conversation (maintains context)
response2 = chat.chat("What are their main advantages?")
print(response2)
response3 = chat.chat("Can you explain the first paper in more detail?")
print(response3)
Custom Parameters
# Query with custom settings
response = chat.query(
query="Explain self-attention mechanisms",
n_results=10, # Use 10 papers for context
temperature=0.8, # More creative responses
max_tokens=2000, # Longer responses
system_prompt="You are a helpful research assistant."
)
MCP Clustering Tools Integration
Coming in v0.3.0: RAGChat now integrates with MCP clustering tools, allowing the LLM to automatically analyze conference topics, trends, and developments.
What Are MCP Tools?
MCP (Model Context Protocol) tools are specialized functions that the LLM can call to perform specific tasks. The RAG system includes four clustering tools:
get_cluster_topics - Analyze overall conference topics
get_topic_evolution - Track how topics evolve over time
get_recent_developments - Find recent papers in specific areas
get_cluster_visualization - Generate cluster visualizations
How It Works
When MCP tools are enabled (default), the LLM automatically decides when to use clustering tools based on the user’s question:
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
from abstracts_explorer.rag import RAGChat
# Initialize components
em = EmbeddingsManager()
em.connect()
db = DatabaseManager()
db.connect()
# Create RAG chat with MCP tools enabled (default)
chat = RAGChat(em, db, enable_mcp_tools=True)
# Ask about general topics - LLM will use clustering tools
response = chat.query("What are the main research topics at NeurIPS?")
print(response)
# The LLM automatically calls get_cluster_topics() and analyzes the results
# Ask about trends - LLM uses topic evolution tool
response = chat.query("How have transformers evolved at NeurIPS over the years?")
print(response)
# The LLM calls get_topic_evolution(topic_keywords="transformers")
# Ask about recent work - LLM uses recent developments tool
response = chat.query("What are the latest papers on reinforcement learning?")
print(response)
# The LLM calls get_recent_developments(topic_keywords="reinforcement learning")
Tool Selection
The LLM automatically decides which tool(s) to use based on the question:
Questions about “main topics”, “themes”, “areas” → Uses
get_cluster_topicsQuestions about “evolution”, “trends”, “over time” → Uses
get_topic_evolutionQuestions about “recent”, “latest”, “new” → Uses
get_recent_developmentsQuestions about specific papers → Uses standard RAG (no tools)
Disabling MCP Tools
If you want to disable MCP tools and use only standard RAG:
# Disable MCP tools
chat = RAGChat(em, db, enable_mcp_tools=False)
# Now all queries use only paper embeddings search
response = chat.query("What are the main topics?")
# Will search for relevant papers instead of clustering
Combining Tools with RAG
The LLM can use both tools AND paper retrieval in the same query:
# Complex query that might use both
response = chat.query(
"What are the main topics at NeurIPS, and can you explain "
"the attention mechanism paper in detail?"
)
# The LLM might:
# 1. Call get_cluster_topics() to identify main topics
# 2. Search for "attention mechanism" papers
# 3. Combine both to generate comprehensive answer
Tool Call Examples
Here are example questions that trigger different tools:
# Cluster analysis
chat.query("What topics are covered in the conference?")
chat.query("Show me the main research areas")
chat.query("Analyze the distribution of papers by topic")
# Topic evolution
chat.query("How has deep learning research changed over time?")
chat.query("Show me the trend for transformer papers from 2020-2025")
chat.query("Has interest in GANs increased or decreased?")
# Recent developments
chat.query("What are the latest papers on LLMs?")
chat.query("Show me recent work in computer vision")
chat.query("What's new in the last 2 years for neural architecture search?")
# Specific papers (no tools, standard RAG)
chat.query("Explain the Vision Transformer paper")
chat.query("What does the BERT paper say about pre-training?")
chat.query("Find papers about attention mechanisms")
Advanced: Tool Call Debugging
To see which tools the LLM is calling, check the logs:
import logging
# Enable debug logging
logging.basicConfig(level=logging.INFO)
# Now tool calls will be logged
chat = RAGChat(em, db, enable_mcp_tools=True)
response = chat.query("What are the main topics?")
# Logs: "LLM requested tool: get_cluster_topics with args: {'n_clusters': 8}"
Requirements
MCP tools require:
Embeddings created: Run
abstracts-explorer create-embeddingsfirstPapers downloaded: Have conference data in database
Tool-capable LLM: Model must support function calling (e.g., GPT-3.5+, Gemma 3, Claude)
If your LLM doesn’t support function calling, disable MCP tools:
chat = RAGChat(em, db, enable_mcp_tools=False)
Metadata Filtering
# Query papers from specific year
response = chat.query(
"What are the main themes in 2025?",
where={"year": 2025}
)
# Multiple filters
response = chat.query(
"Explain recent attention mechanisms",
where={
"year": {"$gte": 2024},
},
n_results=5
)
Conversation Management
Reset Conversation
# Clear conversation history
chat.reset_conversation()
# Start fresh conversation
response = chat.query("New topic...")
Export Conversation
# Export to JSON file
chat.export_conversation("conversation.json")
# Export returns the conversation data
conversation_data = chat.export_conversation("chat_history.json")
Conversation Format
Exported conversations include:
{
"timestamp": "2025-11-26T10:00:00",
"model": "gemma-3-4b-it-qat",
"messages": [
{
"role": "user",
"content": "What is a transformer?",
"timestamp": "2025-11-26T10:00:00"
},
{
"role": "assistant",
"content": "A transformer is...",
"papers_used": [
{"id": "123", "title": "Attention Is All You Need"}
],
"timestamp": "2025-11-26T10:00:05"
}
]
}
LLM Backend Configuration
Supported Backends
The module is designed for LM Studio but works with any OpenAI-compatible API:
LM Studio (default)
OpenAI API
LocalAI
Ollama (with OpenAI compatibility)
Authentication
# With authentication token
chat = RAGChat(
em,
lm_studio_url="https://api.example.com",
model="gpt-4",
auth_token="sk-..."
)
Custom Endpoints
# Different backend URL
chat = RAGChat(
em,
lm_studio_url="http://localhost:8080",
model="custom-model"
)
Response Generation
How RAG Works
Retrieve: Search for relevant papers using embeddings
Augment: Build context from paper abstracts
Generate: Send context + query to LLM
Return: Get AI-generated response with citations
Context Building
The RAG system builds context from retrieved papers:
Context includes:
- Paper titles
- Paper abstracts
- Paper years
- Relevance scores
Formatted for optimal LLM comprehension
System Prompts
Default system prompt:
You are a helpful research assistant specializing in NeurIPS papers.
Answer questions based on the provided paper abstracts.
Cite papers by title when referencing them.
Custom system prompt:
chat.query(
"Your question",
system_prompt="You are an expert in computer vision..."
)
Error Handling
try:
response = chat.query("What is deep learning?")
except requests.RequestException:
print("LLM backend connection failed")
except ValueError as e:
print(f"Invalid response: {e}")
except Exception as e:
print(f"Error: {e}")
Performance Considerations
Response Time
Factors affecting response time:
Number of papers retrieved (n_results)
LLM model size and speed
Token generation length (max_tokens)
Network latency to LLM backend
Memory Usage
Conversation history stored in memory
Each message adds to context
Use
reset_conversation()for long sessions
Optimization Tips
# Faster responses
chat.query(query, n_results=3, max_tokens=500)
# More comprehensive but slower
chat.query(query, n_results=10, max_tokens=2000)
# Balance quality and speed
chat.query(query, n_results=5, max_tokens=1000)
Best Practices
Start specific - Focused queries get better results
Use filters - Narrow search space with metadata filters
Manage history - Reset conversation when changing topics
Export important conversations - Save valuable interactions
Adjust parameters - Tune n_results and temperature for your needs
Monitor backend - Ensure LM Studio/LLM is running and responsive
Handle errors - Wrap calls in try-except for production use