RAG Module

The RAG (Retrieval-Augmented Generation) module provides a chat interface for querying papers using LLMs.

Overview

The RAGChat class implements:

  • Retrieval-Augmented Generation for paper queries

  • Conversation history management

  • Integration with LM Studio LLM backend

  • Context building from relevant papers

  • NEW: MCP Clustering Tools Integration - Automatic tool calling for topic analysis

Class Reference

Retrieval Augmented Generation (RAG) for NeurIPS abstracts.

This module provides RAG functionality to query papers and generate contextual responses using OpenAI-compatible language model APIs.

exception abstracts_explorer.rag.RAGError[source]

Bases: Exception

Exception raised for RAG-related errors.

class abstracts_explorer.rag.RAGChat(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]

Bases: object

RAG chat interface for querying NeurIPS papers.

Uses embeddings for semantic search and OpenAI-compatible API for response generation. Optionally integrates with MCP clustering tools to answer questions about conference topics, trends, and developments.

Parameters:
  • embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.

  • database (DatabaseManager) – Database instance for querying paper details. REQUIRED.

  • lm_studio_url (str, optional) – URL for OpenAI-compatible API, by default “http://localhost:1234

  • model (str, optional) – Name of the language model, by default “auto”

  • max_context_papers (int, optional) – Maximum number of papers to include in context, by default 5

  • temperature (float, optional) – Sampling temperature for generation, by default 0.7

  • enable_mcp_tools (bool, optional) – Enable MCP clustering tools for topic analysis, by default True

Examples

>>> from abstracts_explorer.embeddings import EmbeddingsManager
>>> from abstracts_explorer.database import DatabaseManager
>>> em = EmbeddingsManager()
>>> em.connect()
>>> db = DatabaseManager()
>>> db.connect()
>>> chat = RAGChat(em, db)
>>>
>>> # Ask about specific papers
>>> response = chat.query("What are the latest advances in deep learning?")
>>> print(response)
>>>
>>> # Ask about conference topics (uses MCP tools automatically)
>>> response = chat.query("What are the main research topics at NeurIPS?")
>>> print(response)
__init__(embeddings_manager, database, lm_studio_url=None, model=None, max_context_papers=None, temperature=None, enable_mcp_tools=True)[source]

Initialize RAG chat.

Parameters are optional and will use values from environment/config if not provided.

Parameters:
  • embeddings_manager (EmbeddingsManager) – Manager for embeddings and vector search.

  • database (DatabaseManager) – Database instance for querying paper details. REQUIRED - no fallback allowed.

  • lm_studio_url (str, optional) – URL for OpenAI-compatible API. If None, uses config value.

  • model (str, optional) – Name of the language model. If None, uses config value.

  • max_context_papers (int, optional) – Maximum number of papers to include in context. If None, uses config value.

  • temperature (float, optional) – Sampling temperature for generation. If None, uses config value.

  • enable_mcp_tools (bool, optional) – Whether to enable MCP clustering tools for the LLM. Default is True.

Raises:

RAGError – If required parameters are missing or invalid.

property openai_client: OpenAI

Get the OpenAI client, creating it lazily on first access.

This lazy loading prevents API calls during test collection.

Returns:

Initialized OpenAI client instance.

Return type:

OpenAI

query(question, n_results=None, metadata_filter=None, system_prompt=None)[source]

Query the RAG system with a question.

Parameters:
  • question (str) – User’s question.

  • n_results (int, optional) – Number of papers to retrieve for context.

  • metadata_filter (dict, optional) – Metadata filter for paper search.

  • system_prompt (str, optional) – Custom system prompt for the model.

Returns:

Dictionary containing: - response: str - Generated response - papers: list - Retrieved papers used as context - metadata: dict - Additional metadata

Return type:

dict

Raises:

RAGError – If query fails.

Examples

>>> response = chat.query("What is attention mechanism?")
>>> print(response["response"])
>>> print(f"Based on {len(response['papers'])} papers")
chat(message, use_context=True, n_results=None)[source]

Continue conversation with context awareness.

Parameters:
  • message (str) – User’s message.

  • use_context (bool, optional) – Whether to retrieve papers as context, by default True

  • n_results (int, optional) – Number of papers to retrieve.

Returns:

Dictionary containing response and metadata.

Return type:

dict

Examples

>>> response = chat.chat("Tell me more about transformers")
>>> print(response["response"])
reset_conversation()[source]

Reset conversation history and cached context.

Examples

>>> chat.reset_conversation()
export_conversation(output_path)[source]

Export conversation history to JSON file.

Parameters:

output_path (Path) – Path to output JSON file.

Return type:

None

Examples

>>> chat.export_conversation("conversation.json")

Usage Examples

Basic Setup

from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.rag import RAGChat

# Initialize embeddings manager
em = EmbeddingsManager()

# Initialize RAG chat
chat = RAGChat(
    embeddings_manager=em,
    lm_studio_url="http://localhost:1234",
    model="gemma-3-4b-it-qat"
)

Simple Query

# Ask a question
response = chat.query(
    "What are the latest developments in transformer architectures?"
)

print(response)

Conversation

# Start a conversation
response1 = chat.query("Tell me about vision transformers")
print(response1)

# Continue the conversation (maintains context)
response2 = chat.chat("What are their main advantages?")
print(response2)

response3 = chat.chat("Can you explain the first paper in more detail?")
print(response3)

Custom Parameters

# Query with custom settings
response = chat.query(
    query="Explain self-attention mechanisms",
    n_results=10,              # Use 10 papers for context
    temperature=0.8,           # More creative responses
    max_tokens=2000,           # Longer responses
    system_prompt="You are a helpful research assistant."
)

MCP Clustering Tools Integration

Coming in v0.3.0: RAGChat now integrates with MCP clustering tools, allowing the LLM to automatically analyze conference topics, trends, and developments.

What Are MCP Tools?

MCP (Model Context Protocol) tools are specialized functions that the LLM can call to perform specific tasks. The RAG system includes four clustering tools:

  1. get_cluster_topics - Analyze overall conference topics

  2. get_topic_evolution - Track how topics evolve over time

  3. get_recent_developments - Find recent papers in specific areas

  4. get_cluster_visualization - Generate cluster visualizations

How It Works

When MCP tools are enabled (default), the LLM automatically decides when to use clustering tools based on the user’s question:

from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
from abstracts_explorer.rag import RAGChat

# Initialize components
em = EmbeddingsManager()
em.connect()
db = DatabaseManager()
db.connect()

# Create RAG chat with MCP tools enabled (default)
chat = RAGChat(em, db, enable_mcp_tools=True)

# Ask about general topics - LLM will use clustering tools
response = chat.query("What are the main research topics at NeurIPS?")
print(response)
# The LLM automatically calls get_cluster_topics() and analyzes the results

# Ask about trends - LLM uses topic evolution tool
response = chat.query("How have transformers evolved at NeurIPS over the years?")
print(response)
# The LLM calls get_topic_evolution(topic_keywords="transformers")

# Ask about recent work - LLM uses recent developments tool
response = chat.query("What are the latest papers on reinforcement learning?")
print(response)
# The LLM calls get_recent_developments(topic_keywords="reinforcement learning")

Tool Selection

The LLM automatically decides which tool(s) to use based on the question:

  • Questions about “main topics”, “themes”, “areas” → Uses get_cluster_topics

  • Questions about “evolution”, “trends”, “over time” → Uses get_topic_evolution

  • Questions about “recent”, “latest”, “new” → Uses get_recent_developments

  • Questions about specific papers → Uses standard RAG (no tools)

Disabling MCP Tools

If you want to disable MCP tools and use only standard RAG:

# Disable MCP tools
chat = RAGChat(em, db, enable_mcp_tools=False)

# Now all queries use only paper embeddings search
response = chat.query("What are the main topics?")
# Will search for relevant papers instead of clustering

Combining Tools with RAG

The LLM can use both tools AND paper retrieval in the same query:

# Complex query that might use both
response = chat.query(
    "What are the main topics at NeurIPS, and can you explain "
    "the attention mechanism paper in detail?"
)
# The LLM might:
# 1. Call get_cluster_topics() to identify main topics
# 2. Search for "attention mechanism" papers
# 3. Combine both to generate comprehensive answer

Tool Call Examples

Here are example questions that trigger different tools:

# Cluster analysis
chat.query("What topics are covered in the conference?")
chat.query("Show me the main research areas")
chat.query("Analyze the distribution of papers by topic")

# Topic evolution
chat.query("How has deep learning research changed over time?")
chat.query("Show me the trend for transformer papers from 2020-2025")
chat.query("Has interest in GANs increased or decreased?")

# Recent developments
chat.query("What are the latest papers on LLMs?")
chat.query("Show me recent work in computer vision")
chat.query("What's new in the last 2 years for neural architecture search?")

# Specific papers (no tools, standard RAG)
chat.query("Explain the Vision Transformer paper")
chat.query("What does the BERT paper say about pre-training?")
chat.query("Find papers about attention mechanisms")

Advanced: Tool Call Debugging

To see which tools the LLM is calling, check the logs:

import logging

# Enable debug logging
logging.basicConfig(level=logging.INFO)

# Now tool calls will be logged
chat = RAGChat(em, db, enable_mcp_tools=True)
response = chat.query("What are the main topics?")
# Logs: "LLM requested tool: get_cluster_topics with args: {'n_clusters': 8}"

Requirements

MCP tools require:

  • Embeddings created: Run abstracts-explorer create-embeddings first

  • Papers downloaded: Have conference data in database

  • Tool-capable LLM: Model must support function calling (e.g., GPT-3.5+, Gemma 3, Claude)

If your LLM doesn’t support function calling, disable MCP tools:

chat = RAGChat(em, db, enable_mcp_tools=False)

Metadata Filtering

# Query papers from specific year
response = chat.query(
    "What are the main themes in 2025?",
    where={"year": 2025}
)

# Multiple filters
response = chat.query(
    "Explain recent attention mechanisms",
    where={
        "year": {"$gte": 2024},
    },
    n_results=5
)

Conversation Management

Reset Conversation

# Clear conversation history
chat.reset_conversation()

# Start fresh conversation
response = chat.query("New topic...")

Export Conversation

# Export to JSON file
chat.export_conversation("conversation.json")

# Export returns the conversation data
conversation_data = chat.export_conversation("chat_history.json")

Conversation Format

Exported conversations include:

{
    "timestamp": "2025-11-26T10:00:00",
    "model": "gemma-3-4b-it-qat",
    "messages": [
        {
            "role": "user",
            "content": "What is a transformer?",
            "timestamp": "2025-11-26T10:00:00"
        },
        {
            "role": "assistant",
            "content": "A transformer is...",
            "papers_used": [
                {"id": "123", "title": "Attention Is All You Need"}
            ],
            "timestamp": "2025-11-26T10:00:05"
        }
    ]
}

LLM Backend Configuration

Supported Backends

The module is designed for LM Studio but works with any OpenAI-compatible API:

  • LM Studio (default)

  • OpenAI API

  • LocalAI

  • Ollama (with OpenAI compatibility)

Authentication

# With authentication token
chat = RAGChat(
    em,
    lm_studio_url="https://api.example.com",
    model="gpt-4",
    auth_token="sk-..."
)

Custom Endpoints

# Different backend URL
chat = RAGChat(
    em,
    lm_studio_url="http://localhost:8080",
    model="custom-model"
)

Response Generation

How RAG Works

  1. Retrieve: Search for relevant papers using embeddings

  2. Augment: Build context from paper abstracts

  3. Generate: Send context + query to LLM

  4. Return: Get AI-generated response with citations

Context Building

The RAG system builds context from retrieved papers:

Context includes:
- Paper titles
- Paper abstracts
- Paper years
- Relevance scores

Formatted for optimal LLM comprehension

System Prompts

Default system prompt:

You are a helpful research assistant specializing in NeurIPS papers.
Answer questions based on the provided paper abstracts.
Cite papers by title when referencing them.

Custom system prompt:

chat.query(
    "Your question",
    system_prompt="You are an expert in computer vision..."
)

Error Handling

try:
    response = chat.query("What is deep learning?")
except requests.RequestException:
    print("LLM backend connection failed")
except ValueError as e:
    print(f"Invalid response: {e}")
except Exception as e:
    print(f"Error: {e}")

Performance Considerations

Response Time

Factors affecting response time:

  • Number of papers retrieved (n_results)

  • LLM model size and speed

  • Token generation length (max_tokens)

  • Network latency to LLM backend

Memory Usage

  • Conversation history stored in memory

  • Each message adds to context

  • Use reset_conversation() for long sessions

Optimization Tips

# Faster responses
chat.query(query, n_results=3, max_tokens=500)

# More comprehensive but slower
chat.query(query, n_results=10, max_tokens=2000)

# Balance quality and speed
chat.query(query, n_results=5, max_tokens=1000)

Best Practices

  1. Start specific - Focused queries get better results

  2. Use filters - Narrow search space with metadata filters

  3. Manage history - Reset conversation when changing topics

  4. Export important conversations - Save valuable interactions

  5. Adjust parameters - Tune n_results and temperature for your needs

  6. Monitor backend - Ensure LM Studio/LLM is running and responsive

  7. Handle errors - Wrap calls in try-except for production use