RAG Module

The RAG (Retrieval-Augmented Generation) module provides a chat interface for querying papers using LLMs.

Overview

The RAGChat class implements:

Retrieval-Augmented Generation for paper queries
Conversation history management
Integration with LM Studio LLM backend
Context building from relevant papers
NEW: MCP Clustering Tools Integration - Automatic tool calling for topic analysis

Class Reference

Retrieval Augmented Generation (RAG) for NeurIPS abstracts.

This module provides RAG functionality to query papers and generate contextual responses using OpenAI-compatible language model APIs.

exception abstracts_explorer.rag.RAGError[source]

Bases: Exception

Exception raised for RAG-related errors.

class abstracts_explorer.rag.RAGChat(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]

Bases: object

RAG chat interface for querying NeurIPS papers.

Uses embeddings for semantic search and OpenAI-compatible API for response generation. Optionally integrates with MCP clustering tools to answer questions about conference topics, trends, and developments.

Parameters:

embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.
database (DatabaseManager) – Database instance for querying paper details. REQUIRED.
lm_studio_url (str, optional) – URL for OpenAI-compatible API, by default “http://localhost:1234”
model (str, optional) – Name of the language model, by default “auto”
max_context_papers (int, optional) – Maximum number of papers to include in context, by default 5
temperature (float, optional) – Sampling temperature for generation, by default 0.7
enable_mcp_tools (bool, optional) – Enable MCP clustering tools for topic analysis, by default True

Examples

>>> from abstracts_explorer.embeddings import EmbeddingsManager
>>> from abstracts_explorer.database import DatabaseManager
>>> em = EmbeddingsManager()
>>> em.connect()
>>> db = DatabaseManager()
>>> db.connect()
>>> chat = RAGChat(em, db)
>>>
>>> # Ask about specific papers
>>> response = chat.query("What are the latest advances in deep learning?")
>>> print(response)
>>>
>>> # Ask about conference topics (uses MCP tools automatically)
>>> response = chat.query("What are the main research topics at NeurIPS?")
>>> print(response)

__init__(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]

Initialize RAG chat.

Parameters are optional and will use values from environment/config if not provided.

Parameters:

embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.
database (DatabaseManager) – Database instance for querying paper details. REQUIRED - no fallback allowed.
lm_studio_url (str, optional) – URL for OpenAI-compatible API. If None, uses config value.
model (str, optional) – Name of the language model. If None, uses config value.
max_context_papers (int, optional) – Maximum number of papers to include in context. If None, uses config value.
temperature (float, optional) – Sampling temperature for generation. If None, uses config value.
enable_mcp_tools (bool, optional) – Whether to enable MCP clustering tools for the LLM. Default is True.

Raises:

RAGError – If required parameters are missing or invalid.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

This lazy loading prevents API calls during test collection.

Returns:: Initialized OpenAI client instance.
Return type:: OpenAI

query(question, n_results=None, metadata_filter=None, system_prompt=None)[source]

Query the RAG system with a question.

Parameters:

question (str) – User’s question.
n_results (int, optional) – Number of papers to retrieve for context.
metadata_filter (dict, optional) – Metadata filter for paper search.
system_prompt (str, optional) – Custom system prompt for the model.

Returns:

Dictionary containing: - response: str - Generated response - papers: list - Retrieved papers used as context - metadata: dict - Additional metadata

Return type:

dict

Raises:

RAGError – If query fails.

Examples

>>> response = chat.query("What is attention mechanism?")
>>> print(response["response"])
>>> print(f"Based on {len(response['papers'])} papers")

chat(message, use_context=True, n_results=None)[source]

Continue conversation with context awareness.

Parameters:

message (str) – User’s message.
use_context (bool, optional) – Whether to retrieve papers as context, by default True
n_results (int, optional) – Number of papers to retrieve.

Returns:

Dictionary containing response and metadata.

Return type:

dict

Examples

>>> response = chat.chat("Tell me more about transformers")
>>> print(response["response"])

reset_conversation()[source]

Reset conversation history and cached context.

Examples

>>> chat.reset_conversation()

export_conversation(output_path)[source]

Export conversation history to JSON file.

Parameters:: output_path (Path) – Path to output JSON file.
Return type:: None

Examples

>>> chat.export_conversation("conversation.json")

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.rag import RAGChat

# Initialize embeddings manager
em = EmbeddingsManager()

# Initialize RAG chat
chat = RAGChat(
    embeddings_manager=em,
    lm_studio_url="http://localhost:1234",
    model="gemma-3-4b-it-qat"
)

Simple Query

# Ask a question
response = chat.query(
    "What are the latest developments in transformer architectures?"
)

print(response)

Conversation

# Start a conversation
response1 = chat.query("Tell me about vision transformers")
print(response1)

# Continue the conversation (maintains context)
response2 = chat.chat("What are their main advantages?")
print(response2)

response3 = chat.chat("Can you explain the first paper in more detail?")
print(response3)

Custom Parameters

# Query with custom settings
response = chat.query(
    query="Explain self-attention mechanisms",
    n_results=10,              # Use 10 papers for context
    temperature=0.8,           # More creative responses
    max_tokens=2000,           # Longer responses
    system_prompt="You are a helpful research assistant."
)

MCP Clustering Tools Integration

Coming in v0.3.0: RAGChat now integrates with MCP clustering tools, allowing the LLM to automatically analyze conference topics, trends, and developments.

What Are MCP Tools?

MCP (Model Context Protocol) tools are specialized functions that the LLM can call to perform specific tasks. The RAG system includes four clustering tools:

get_cluster_topics - Analyze overall conference topics
get_topic_evolution - Track how topics evolve over time
get_recent_developments - Find recent papers in specific areas
get_cluster_visualization - Generate cluster visualizations

How It Works

When MCP tools are enabled (default), the LLM automatically decides when to use clustering tools based on the user’s question:

from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
from abstracts_explorer.rag import RAGChat

# Initialize components
em = EmbeddingsManager()
em.connect()
db = DatabaseManager()
db.connect()

# Create RAG chat with MCP tools enabled (default)
chat = RAGChat(em, db, enable_mcp_tools=True)

# Ask about general topics - LLM will use clustering tools
response = chat.query("What are the main research topics at NeurIPS?")
print(response)
# The LLM automatically calls get_cluster_topics() and analyzes the results

# Ask about trends - LLM uses topic evolution tool
response = chat.query("How have transformers evolved at NeurIPS over the years?")
print(response)
# The LLM calls get_topic_evolution(topic_keywords="transformers")

# Ask about recent work - LLM uses recent developments tool
response = chat.query("What are the latest papers on reinforcement learning?")
print(response)
# The LLM calls get_recent_developments(topic_keywords="reinforcement learning")

Tool Selection

The LLM automatically decides which tool(s) to use based on the question:

Questions about “main topics”, “themes”, “areas” → Uses get_cluster_topics
Questions about “evolution”, “trends”, “over time” → Uses get_topic_evolution
Questions about “recent”, “latest”, “new” → Uses get_recent_developments
Questions about specific papers → Uses standard RAG (no tools)

Disabling MCP Tools

If you want to disable MCP tools and use only standard RAG:

# Disable MCP tools
chat = RAGChat(em, db, enable_mcp_tools=False)

# Now all queries use only paper embeddings search
response = chat.query("What are the main topics?")
# Will search for relevant papers instead of clustering

Combining Tools with RAG

The LLM can use both tools AND paper retrieval in the same query:

# Complex query that might use both
response = chat.query(
    "What are the main topics at NeurIPS, and can you explain "
    "the attention mechanism paper in detail?"
)
# The LLM might:
# 1. Call get_cluster_topics() to identify main topics
# 2. Search for "attention mechanism" papers
# 3. Combine both to generate comprehensive answer

Tool Call Examples

Here are example questions that trigger different tools:

# Cluster analysis
chat.query("What topics are covered in the conference?")
chat.query("Show me the main research areas")
chat.query("Analyze the distribution of papers by topic")

# Topic evolution
chat.query("How has deep learning research changed over time?")
chat.query("Show me the trend for transformer papers from 2020-2025")
chat.query("Has interest in GANs increased or decreased?")

# Recent developments
chat.query("What are the latest papers on LLMs?")
chat.query("Show me recent work in computer vision")
chat.query("What's new in the last 2 years for neural architecture search?")

# Specific papers (no tools, standard RAG)
chat.query("Explain the Vision Transformer paper")
chat.query("What does the BERT paper say about pre-training?")
chat.query("Find papers about attention mechanisms")

Advanced: Tool Call Debugging

To see which tools the LLM is calling, check the logs:

import logging

# Enable debug logging
logging.basicConfig(level=logging.INFO)

# Now tool calls will be logged
chat = RAGChat(em, db, enable_mcp_tools=True)
response = chat.query("What are the main topics?")
# Logs: "LLM requested tool: get_cluster_topics with args: {'n_clusters': 8}"

Requirements

MCP tools require:

Embeddings created: Run abstracts-explorer create-embeddings first
Papers downloaded: Have conference data in database
Tool-capable LLM: Model must support function calling (e.g., GPT-3.5+, Gemma 3, Claude)

If your LLM doesn’t support function calling, disable MCP tools:

chat = RAGChat(em, db, enable_mcp_tools=False)

Metadata Filtering

# Query papers from specific year
response = chat.query(
    "What are the main themes in 2025?",
    where={"year": 2025}
)

# Multiple filters
response = chat.query(
    "Explain recent attention mechanisms",
    where={
        "year": {"$gte": 2024},
    },
    n_results=5
)

Conversation Management

Reset Conversation

# Clear conversation history
chat.reset_conversation()

# Start fresh conversation
response = chat.query("New topic...")

Export Conversation

# Export to JSON file
chat.export_conversation("conversation.json")

# Export returns the conversation data
conversation_data = chat.export_conversation("chat_history.json")

Conversation Format

Exported conversations include:

{
    "timestamp": "2025-11-26T10:00:00",
    "model": "gemma-3-4b-it-qat",
    "messages": [
        {
            "role": "user",
            "content": "What is a transformer?",
            "timestamp": "2025-11-26T10:00:00"
        },
        {
            "role": "assistant",
            "content": "A transformer is...",
            "papers_used": [
                {"id": "123", "title": "Attention Is All You Need"}
            ],
            "timestamp": "2025-11-26T10:00:05"
        }
    ]
}

LLM Backend Configuration

Supported Backends

The module is designed for LM Studio but works with any OpenAI-compatible API:

LM Studio (default)
OpenAI API
LocalAI
Ollama (with OpenAI compatibility)

Authentication

# With authentication token
chat = RAGChat(
    em,
    lm_studio_url="https://api.example.com",
    model="gpt-4",
    auth_token="sk-..."
)

Custom Endpoints

# Different backend URL
chat = RAGChat(
    em,
    lm_studio_url="http://localhost:8080",
    model="custom-model"
)

Response Generation

How RAG Works

Retrieve: Search for relevant papers using embeddings
Augment: Build context from paper abstracts
Generate: Send context + query to LLM
Return: Get AI-generated response with citations

Context Building

The RAG system builds context from retrieved papers:

Context includes:
- Paper titles
- Paper abstracts
- Paper years
- Relevance scores

Formatted for optimal LLM comprehension

System Prompts

Default system prompt:

You are a helpful research assistant specializing in NeurIPS papers.
Answer questions based on the provided paper abstracts.
Cite papers by title when referencing them.

Custom system prompt:

chat.query(
    "Your question",
    system_prompt="You are an expert in computer vision..."
)

Error Handling

try:
    response = chat.query("What is deep learning?")
except requests.RequestException:
    print("LLM backend connection failed")
except ValueError as e:
    print(f"Invalid response: {e}")
except Exception as e:
    print(f"Error: {e}")

Performance Considerations

Response Time

Factors affecting response time:

Number of papers retrieved (n_results)
LLM model size and speed
Token generation length (max_tokens)
Network latency to LLM backend

Memory Usage

Conversation history stored in memory
Each message adds to context
Use reset_conversation() for long sessions

Optimization Tips

# Faster responses
chat.query(query, n_results=3, max_tokens=500)

# More comprehensive but slower
chat.query(query, n_results=10, max_tokens=2000)

# Balance quality and speed
chat.query(query, n_results=5, max_tokens=1000)

Best Practices

Start specific - Focused queries get better results
Use filters - Narrow search space with metadata filters
Manage history - Reset conversation when changing topics
Export important conversations - Save valuable interactions
Adjust parameters - Tune n_results and temperature for your needs
Monitor backend - Ensure LM Studio/LLM is running and responsive
Handle errors - Wrap calls in try-except for production use