MCP Server for Cluster Analysis

The Abstracts Explorer includes a Model Context Protocol (MCP) server that provides tools for analyzing clustered paper embeddings. This enables LLM-based assistants to answer sophisticated questions about research topics, trends, and developments.

NEW: MCP tools are automatically integrated into the RAG chat system, allowing the LLM to decide when to use clustering analysis versus paper retrieval.

What is MCP?

Model Context Protocol (MCP) is a protocol that allows tools and servers to provide context and capabilities to LLM-based applications. The MCP server exposes tools that can be called by LLM assistants to perform specific tasks.

Features

The MCP server provides four main tools:

1. `get_cluster_topics`

Analyzes clustered embeddings to identify the most frequently mentioned topics in each cluster.

Parameters:

n_clusters (int): Number of clusters to create (default: 8)
reduction_method (str): Dimensionality reduction method - ‘pca’ or ‘tsne’ (default: ‘pca’)
clustering_method (str): Clustering algorithm - ‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘spectral’, or ‘fuzzy_cmeans’ (default: ‘kmeans’)
embeddings_path (str, optional): Path to ChromaDB embeddings database
collection_name (str, optional): Name of ChromaDB collection
db_path (str, optional): Path to SQLite database

Returns: JSON with cluster statistics and topics for each cluster, including:

Keywords and their frequencies
Common sessions
Year distribution
Sample paper titles

Example use case: “What are the most frequently mentioned topics in the conference papers?”

2. `get_topic_evolution`

Analyzes how specific topics have evolved over the years.

Parameters:

topic_keywords (str): Keywords describing the topic (e.g., “transformers attention”)
conference (str, optional): Filter by conference name
start_year (int, optional): Start year for analysis
end_year (int, optional): End year for analysis
embeddings_path (str, optional): Path to ChromaDB embeddings database
collection_name (str, optional): Name of ChromaDB collection
db_path (str, optional): Path to SQLite database

Returns: JSON with topic evolution data including:

Year-by-year paper counts
Sample papers from each year
Relevance scores

Example use case: “How have topics related to ‘transformer architectures’ evolved over the years at NeurIPS?”

3. `get_recent_developments`

Finds the most important recent developments in a specific topic.

Parameters:

topic_keywords (str): Keywords describing the topic
n_years (int): Number of recent years to consider (default: 2)
n_results (int): Number of papers to return (default: 10)
conference (str, optional): Filter by conference name
embeddings_path (str, optional): Path to ChromaDB embeddings database
collection_name (str, optional): Name of ChromaDB collection
db_path (str, optional): Path to SQLite database

Returns: JSON with recent papers including:

Paper titles and abstracts
Years and conferences
Relevance scores

Example use case: “What are the most important recent developments in large language models?”

4. `get_cluster_visualization`

Generates visualization data for clustered embeddings.

Parameters:

n_clusters (int): Number of clusters (default: 8)
reduction_method (str): Reduction method - ‘pca’ or ‘tsne’ (default: ‘tsne’)
clustering_method (str): Clustering method - ‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘spectral’, or ‘fuzzy_cmeans’ (default: ‘kmeans’)
n_components (int): Number of dimensions - 2 or 3 (default: 2)
output_path (str, optional): Path to save visualization JSON
embeddings_path (str, optional): Path to ChromaDB embeddings database
collection_name (str, optional): Name of ChromaDB collection
db_path (str, optional): Path to SQLite database

Returns: JSON with visualization data including:

Point coordinates (x, y, optionally z)
Cluster assignments
Paper metadata
Statistics

Example use case: “Display a graphical representation of the paper clusters.”

Starting the MCP Server

Basic Usage

Start the server with default settings:

abstracts-explorer mcp-server

This starts the server on http://127.0.0.1:8000 with SSE transport.

Custom Host and Port

Start on a custom host and port:

abstracts-explorer mcp-server --host 0.0.0.0 --port 8080

STDIO Transport

For local CLI integration, use stdio transport:

abstracts-explorer mcp-server --transport stdio

Configuration

The MCP server uses the same configuration as the rest of Abstracts Explorer. Configure via .env file:

# Database Configuration
PAPER_DB=abstracts.db
EMBEDDING_DB_PATH=chroma_db
COLLECTION_NAME=papers

# LLM Backend (for embeddings in tools)
LLM_BACKEND_URL=http://localhost:1234
EMBEDDING_MODEL=text-embedding-qwen3-embedding-4b

Integration with LLM Assistants

The MCP server can be integrated with any MCP-compatible LLM assistant or client. It’s now automatically integrated into the RAG chat system.

RAG Chat Integration (Recommended)

The easiest way to use MCP tools is through the RAG chat system, which automatically calls the appropriate tools:

from abstracts_explorer.rag import RAGChat
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager

# Initialize components
em = EmbeddingsManager()
em.connect()
db = DatabaseManager()
db.connect()

# Create RAG chat (MCP tools enabled by default)
chat = RAGChat(em, db, enable_mcp_tools=True)

# The LLM automatically uses clustering tools when appropriate
response = chat.query("What are the main topics at NeurIPS?")
# Internally calls get_cluster_topics() and uses results

response = chat.query("How have transformers evolved over time?")
# Internally calls get_topic_evolution(topic_keywords="transformers")

See the RAG API documentation for more details.

Claude Desktop Integration

Add to Claude Desktop MCP configuration:

{
  "mcpServers": {
    "abstracts-explorer": {
      "command": "abstracts-explorer",
      "args": ["mcp-server", "--transport", "stdio"]
    }
  }
}

Example Tool Call

When an LLM assistant needs to analyze topics, it can call:

{
  "tool": "get_cluster_topics",
  "arguments": {
    "n_clusters": 8,
    "reduction_method": "tsne",
    "clustering_method": "kmeans"
  }
}

The server will:

Load embeddings from ChromaDB
Perform clustering
Analyze topics in each cluster
Return structured JSON results

API Reference

Tool Response Format

All tools return JSON strings with the following general structure:

{
  "statistics": {
    "n_clusters": 8,
    "total_papers": 1000
  },
  "clusters": [
    {
      "cluster_id": 0,
      "paper_count": 150,
      "keywords": [
        {"keyword": "transformer", "count": 45},
        {"keyword": "attention", "count": 38}
      ],
      "sessions": [
        {"session": "Deep Learning", "count": 100}
      ],
      "years": {"2023": 80, "2024": 70},
      "sample_titles": ["Paper 1", "Paper 2", "Paper 3"]
    }
  ]
}

Error Handling

If an error occurs, tools return JSON with an error field:

{
  "error": "Failed to load clustering data: Database not found"
}

Requirements

Before using the MCP server, ensure:

Embeddings are created: Run abstracts-explorer create-embeddings first
Database exists: Download papers with abstracts-explorer download
MCP package installed: uv sync or pip install mcp>=1.0.0

Troubleshooting

“No embeddings found”

Make sure to create embeddings first:

abstracts-explorer create-embeddings

“Failed to connect to database”

Check that the database paths in .env are correct:

# .env
PAPER_DB=data/abstracts.db
EMBEDDING_DB_PATH=chroma_db

Port already in use

Change the port:

abstracts-explorer mcp-server --port 8001

Advanced Usage

Custom Clustering Parameters

Each tool accepts clustering parameters. Examples:

For DBSCAN:

{
  "clustering_method": "dbscan",
  "eps": 0.5,
  "min_samples": 5
}

For Agglomerative Clustering:

{
  "clustering_method": "agglomerative",
  "n_clusters": 10,
  "linkage": "ward"
}

Or with distance threshold:

{
  "clustering_method": "agglomerative",
  "distance_threshold": 5.0,
  "linkage": "average"
}

For Spectral Clustering:

{
  "clustering_method": "spectral",
  "n_clusters": 8,
  "affinity": "nearest_neighbors",
  "n_neighbors": 10
}

For Fuzzy C-Means:

{
  "clustering_method": "fuzzy_cmeans",
  "n_clusters": 8,
  "m": 2.0  # Fuzziness parameter
}

Filtering by Conference

Analyze specific conferences:

# get_topic_evolution arguments
{
  "topic_keywords": "neural networks",
  "conference": "neurips",
  "start_year": 2020,
  "end_year": 2024
}

MCP Server for Cluster Analysis

What is MCP?

Features

1. get_cluster_topics

2. get_topic_evolution

3. get_recent_developments

4. get_cluster_visualization

Starting the MCP Server

Basic Usage

Custom Host and Port

STDIO Transport

Configuration

Integration with LLM Assistants

RAG Chat Integration (Recommended)

Claude Desktop Integration

Example Tool Call

API Reference

Tool Response Format

Error Handling

Requirements

Troubleshooting

“No embeddings found”

“Failed to connect to database”

Port already in use

Advanced Usage

Custom Clustering Parameters

Filtering by Conference

See Also

1. `get_cluster_topics`

2. `get_topic_evolution`

3. `get_recent_developments`

4. `get_cluster_visualization`