# MCP Server for Cluster Analysis The Abstracts Explorer includes a Model Context Protocol (MCP) server that provides tools for analyzing clustered paper embeddings. This enables LLM-based assistants to answer sophisticated questions about research topics, trends, and developments. **NEW**: MCP tools are automatically integrated into the RAG chat system, allowing the LLM to decide when to use clustering analysis versus paper retrieval. ## What is MCP? Model Context Protocol (MCP) is a protocol that allows tools and servers to provide context and capabilities to LLM-based applications. The MCP server exposes tools that can be called by LLM assistants to perform specific tasks. ## Features The MCP server provides four main tools: ### 1. `get_cluster_topics` Analyzes clustered embeddings to identify the most frequently mentioned topics in each cluster. **Parameters:** - `n_clusters` (int): Number of clusters to create (default: 8) - `reduction_method` (str): Dimensionality reduction method - 'pca' or 'tsne' (default: 'pca') - `clustering_method` (str): Clustering algorithm - 'kmeans', 'dbscan', 'agglomerative', 'spectral', or 'fuzzy_cmeans' (default: 'kmeans') - `embeddings_path` (str, optional): Path to ChromaDB embeddings database - `collection_name` (str, optional): Name of ChromaDB collection - `db_path` (str, optional): Path to SQLite database **Returns:** JSON with cluster statistics and topics for each cluster, including: - Keywords and their frequencies - Common sessions - Year distribution - Sample paper titles **Example use case:** "What are the most frequently mentioned topics in the conference papers?" ### 2. `get_topic_evolution` Analyzes how specific topics have evolved over the years. **Parameters:** - `topic_keywords` (str): Keywords describing the topic (e.g., "transformers attention") - `conference` (str, optional): Filter by conference name - `start_year` (int, optional): Start year for analysis - `end_year` (int, optional): End year for analysis - `embeddings_path` (str, optional): Path to ChromaDB embeddings database - `collection_name` (str, optional): Name of ChromaDB collection - `db_path` (str, optional): Path to SQLite database **Returns:** JSON with topic evolution data including: - Year-by-year paper counts - Sample papers from each year - Relevance scores **Example use case:** "How have topics related to 'transformer architectures' evolved over the years at NeurIPS?" ### 3. `get_recent_developments` Finds the most important recent developments in a specific topic. **Parameters:** - `topic_keywords` (str): Keywords describing the topic - `n_years` (int): Number of recent years to consider (default: 2) - `n_results` (int): Number of papers to return (default: 10) - `conference` (str, optional): Filter by conference name - `embeddings_path` (str, optional): Path to ChromaDB embeddings database - `collection_name` (str, optional): Name of ChromaDB collection - `db_path` (str, optional): Path to SQLite database **Returns:** JSON with recent papers including: - Paper titles and abstracts - Years and conferences - Relevance scores **Example use case:** "What are the most important recent developments in large language models?" ### 4. `get_cluster_visualization` Generates visualization data for clustered embeddings. **Parameters:** - `n_clusters` (int): Number of clusters (default: 8) - `reduction_method` (str): Reduction method - 'pca' or 'tsne' (default: 'tsne') - `clustering_method` (str): Clustering method - 'kmeans', 'dbscan', 'agglomerative', 'spectral', or 'fuzzy_cmeans' (default: 'kmeans') - `n_components` (int): Number of dimensions - 2 or 3 (default: 2) - `output_path` (str, optional): Path to save visualization JSON - `embeddings_path` (str, optional): Path to ChromaDB embeddings database - `collection_name` (str, optional): Name of ChromaDB collection - `db_path` (str, optional): Path to SQLite database **Returns:** JSON with visualization data including: - Point coordinates (x, y, optionally z) - Cluster assignments - Paper metadata - Statistics **Example use case:** "Display a graphical representation of the paper clusters." ## Starting the MCP Server ### Basic Usage Start the server with default settings: ```bash abstracts-explorer mcp-server ``` This starts the server on `http://127.0.0.1:8000` with SSE transport. ### Custom Host and Port Start on a custom host and port: ```bash abstracts-explorer mcp-server --host 0.0.0.0 --port 8080 ``` ### STDIO Transport For local CLI integration, use stdio transport: ```bash abstracts-explorer mcp-server --transport stdio ``` ## Configuration The MCP server uses the same configuration as the rest of Abstracts Explorer. Configure via `.env` file: ```bash # Database Configuration PAPER_DB=abstracts.db EMBEDDING_DB_PATH=chroma_db COLLECTION_NAME=papers # LLM Backend (for embeddings in tools) LLM_BACKEND_URL=http://localhost:1234 EMBEDDING_MODEL=text-embedding-qwen3-embedding-4b ``` ## Integration with LLM Assistants The MCP server can be integrated with any MCP-compatible LLM assistant or client. **It's now automatically integrated into the RAG chat system.** ### RAG Chat Integration (Recommended) The easiest way to use MCP tools is through the RAG chat system, which automatically calls the appropriate tools: ```python from abstracts_explorer.rag import RAGChat from abstracts_explorer.embeddings import EmbeddingsManager from abstracts_explorer.database import DatabaseManager # Initialize components em = EmbeddingsManager() em.connect() db = DatabaseManager() db.connect() # Create RAG chat (MCP tools enabled by default) chat = RAGChat(em, db, enable_mcp_tools=True) # The LLM automatically uses clustering tools when appropriate response = chat.query("What are the main topics at NeurIPS?") # Internally calls get_cluster_topics() and uses results response = chat.query("How have transformers evolved over time?") # Internally calls get_topic_evolution(topic_keywords="transformers") ``` See the [RAG API documentation](api/rag.md#mcp-clustering-tools-integration) for more details. ### Claude Desktop Integration Add to Claude Desktop MCP configuration: ```json { "mcpServers": { "abstracts-explorer": { "command": "abstracts-explorer", "args": ["mcp-server", "--transport", "stdio"] } } } ``` ### Example Tool Call When an LLM assistant needs to analyze topics, it can call: ```json { "tool": "get_cluster_topics", "arguments": { "n_clusters": 8, "reduction_method": "tsne", "clustering_method": "kmeans" } } ``` The server will: 1. Load embeddings from ChromaDB 2. Perform clustering 3. Analyze topics in each cluster 4. Return structured JSON results ## API Reference ### Tool Response Format All tools return JSON strings with the following general structure: ```json { "statistics": { "n_clusters": 8, "total_papers": 1000 }, "clusters": [ { "cluster_id": 0, "paper_count": 150, "keywords": [ {"keyword": "transformer", "count": 45}, {"keyword": "attention", "count": 38} ], "sessions": [ {"session": "Deep Learning", "count": 100} ], "years": {"2023": 80, "2024": 70}, "sample_titles": ["Paper 1", "Paper 2", "Paper 3"] } ] } ``` ### Error Handling If an error occurs, tools return JSON with an error field: ```json { "error": "Failed to load clustering data: Database not found" } ``` ## Requirements Before using the MCP server, ensure: 1. **Embeddings are created**: Run `abstracts-explorer create-embeddings` first 2. **Database exists**: Download papers with `abstracts-explorer download` 3. **MCP package installed**: `uv sync` or `pip install mcp>=1.0.0` ## Troubleshooting ### "No embeddings found" Make sure to create embeddings first: ```bash abstracts-explorer create-embeddings ``` ### "Failed to connect to database" Check that the database paths in `.env` are correct: ```bash # .env PAPER_DB=data/abstracts.db EMBEDDING_DB_PATH=chroma_db ``` ### Port already in use Change the port: ```bash abstracts-explorer mcp-server --port 8001 ``` ## Advanced Usage ### Custom Clustering Parameters Each tool accepts clustering parameters. Examples: **For DBSCAN:** ```python { "clustering_method": "dbscan", "eps": 0.5, "min_samples": 5 } ``` **For Agglomerative Clustering:** ```python { "clustering_method": "agglomerative", "n_clusters": 10, "linkage": "ward" } ``` Or with distance threshold: ```python { "clustering_method": "agglomerative", "distance_threshold": 5.0, "linkage": "average" } ``` **For Spectral Clustering:** ```python { "clustering_method": "spectral", "n_clusters": 8, "affinity": "nearest_neighbors", "n_neighbors": 10 } ``` **For Fuzzy C-Means:** ```python { "clustering_method": "fuzzy_cmeans", "n_clusters": 8, "m": 2.0 # Fuzziness parameter } ``` ### Filtering by Conference Analyze specific conferences: ```python # get_topic_evolution arguments { "topic_keywords": "neural networks", "conference": "neurips", "start_year": 2020, "end_year": 2024 } ``` ## See Also - [CLI Reference](cli_reference.md) - Command-line interface documentation - [Clustering Guide](clustering.md) - Clustering and visualization guide - [RAG Chat](rag.md) - Using RAG chat with MCP tools - [Configuration](configuration.md) - Environment configuration options