Clustering Module
The clustering module provides functionality to cluster and visualize paper embeddings using dimensionality reduction and clustering algorithms.
Features
Dimensionality reduction using PCA, t-SNE, and UMAP
Clustering using K-Means, DBSCAN, Agglomerative, Spectral, and Fuzzy C-Means
Automatic cluster labeling using TF-IDF and LLM-based methods
Keyword extraction for each cluster
Representative paper selection based on cluster centroids
Hierarchical cluster structure for agglomerative clustering
Export clustering results to JSON for visualization
Quick Start
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
from abstracts_explorer.clustering import ClusteringManager
# Initialize managers
em = EmbeddingsManager()
db = DatabaseManager()
# Create clustering manager
cm = ClusteringManager(em, db)
# Load embeddings
n_papers = cm.load_embeddings()
print(f"Loaded {n_papers} papers")
# Reduce dimensions and cluster
cm.reduce_dimensions(method="pca", n_components=2)
cm.cluster(method="kmeans", n_clusters=8)
# Generate labels and get results
cm.extract_cluster_keywords()
cm.generate_cluster_labels(use_llm=True)
results = cm.get_clustering_results()
API Reference
This module provides functionality to cluster and visualize paper embeddings using dimensionality reduction and clustering algorithms from scikit-learn.
Features: - Dimensionality reduction using PCA and t-SNE - Clustering using K-Means, DBSCAN, Agglomerative, Fuzzy C-Means, and Spectral clustering - NEW: Automatic cluster labeling using TF-IDF and LLM-based methods - NEW: Keyword extraction for each cluster - NEW: Representative paper selection based on cluster centroids - NEW: Hierarchical cluster structure for agglomerative clustering - Export clustering results to JSON for visualization
Cluster Labeling
The module now includes state-of-the-art cluster labeling functionality that: 1. Extracts distinctive keywords for each cluster using TF-IDF analysis 2. Generates human-readable labels using LLM (Large Language Model) integration 3. Identifies representative papers closest to each cluster’s centroid
Hierarchical Clustering
When using agglomerative clustering with distance_threshold, the module tracks the hierarchical structure of clusters, allowing exploration of sub-clusters.
Example
>>> from abstracts_explorer.clustering import ClusteringManager
>>> from abstracts_explorer.embeddings import EmbeddingsManager
>>>
>>> # Initialize managers
>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> cm = ClusteringManager(em)
>>>
>>> # Load and cluster embeddings
>>> cm.load_embeddings()
>>> cm.cluster(method='kmeans', n_clusters=5)
>>> cm.reduce_dimensions(method='pca', n_components=2)
>>>
>>> # Generate cluster labels
>>> cm.extract_cluster_keywords(n_keywords=10)
>>> cm.generate_cluster_labels(use_llm=True)
>>>
>>> # Get results with labels
>>> results = cm.get_clustering_results()
>>> print(results['cluster_labels']) # Shows generated labels
>>> print(results['cluster_keywords']) # Shows extracted keywords
- exception abstracts_explorer.clustering.ClusteringError[source]
Bases:
ExceptionException raised for clustering operations.
- abstracts_explorer.clustering.calculate_default_clusters(n_papers, min_clusters=2, max_clusters=500)[source]
Calculate default number of clusters based on the number of papers.
Uses the rule: n_clusters = n_papers / 100, clamped to [min_clusters, max_clusters].
- Parameters:
- Returns:
Recommended number of clusters
- Return type:
Examples
>>> calculate_default_clusters(50) 2 >>> calculate_default_clusters(500) 5 >>> calculate_default_clusters(100000) 500
- class abstracts_explorer.clustering.ClusteringManager(embeddings_manager, database=None)[source]
Bases:
objectManager for clustering and dimensionality reduction of embeddings.
This class handles: - Loading embeddings from ChromaDB - Dimensionality reduction (PCA, t-SNE) - Clustering (K-Means, DBSCAN, Agglomerative, Fuzzy C-Means, Spectral) - Automatic cluster labeling using TF-IDF and LLM - Keyword extraction for clusters - Representative paper selection - Hierarchical cluster structure tracking - Export of results for visualization
- Parameters:
embeddings_manager (EmbeddingsManager) – Embeddings manager instance to load embeddings from
database (DatabaseManager, optional) – Database manager for fetching paper metadata
- Variables:
embeddings_manager (EmbeddingsManager) – The embeddings manager instance
database (DatabaseManager or None) – The database manager instance
embeddings (np.ndarray or None) – The loaded embeddings array
paper_ids (list or None) – The paper IDs corresponding to embeddings
metadatas (list or None) – The paper metadata corresponding to embeddings
reduced_embeddings (np.ndarray or None) – The reduced dimensionality embeddings
cluster_labels (np.ndarray or None) – The cluster assignment labels
cluster_label_names (dict or None) – Human-readable names for each cluster
cluster_keywords (dict or None) – Keywords extracted for each cluster
cluster_summaries (dict or None) – Summaries generated for each cluster
cluster_hierarchy (dict or None) – Hierarchical structure of clusters (for agglomerative)
fuzzy_memberships (np.ndarray or None) – Fuzzy membership values (for fuzzy c-means)
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> em.create_collection() >>> cm = ClusteringManager(em) >>> cm.load_embeddings() >>> reduced = cm.reduce_dimensions(method='pca', n_components=2) >>> labels = cm.cluster(method='kmeans', n_clusters=5) >>> cm.extract_cluster_keywords() >>> cm.generate_cluster_labels(use_llm=True) >>> results = cm.get_clustering_results()
- __init__(embeddings_manager, database=None)[source]
Initialize the ClusteringManager.
- Parameters:
embeddings_manager (EmbeddingsManager) – Embeddings manager instance to load embeddings from
database (DatabaseManager, optional) – Database manager for fetching paper metadata
- load_embeddings(limit=None, conferences=None, years=None)[source]
Load embeddings from ChromaDB collection.
- Parameters:
- Returns:
Number of embeddings loaded
- Return type:
- Raises:
ClusteringError – If loading fails or collection is empty
- reduce_dimensions(method='pca', n_components=2, random_state=42, **kwargs)[source]
Reduce dimensionality of embeddings.
- Parameters:
method (str, optional) – Dimensionality reduction method: ‘pca’ or ‘tsne’, by default ‘pca’
n_components (int, optional) – Number of components to reduce to, by default 2
random_state (int, optional) – Random state for reproducibility, by default 42
**kwargs – Additional arguments passed to the reduction algorithm
- Returns:
Reduced embeddings array of shape (n_samples, n_components)
- Return type:
np.ndarray
- Raises:
ClusteringError – If embeddings not loaded or reduction fails
- cluster(method='kmeans', n_clusters=None, random_state=42, use_reduced=False, **kwargs)[source]
Cluster embeddings using specified algorithm.
- Parameters:
method (str, optional) – Clustering method: ‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘fuzzy_cmeans’, or ‘spectral’. By default ‘kmeans’.
n_clusters (int, optional) – Number of clusters (for kmeans, agglomerative, fuzzy_cmeans, and spectral). For agglomerative, can be None if distance_threshold is provided. If None, automatically calculated as n_papers / 100, clamped to [2, 500]. By default None.
random_state (int, optional) – Random state for reproducibility, by default 42
use_reduced (bool, optional) – Whether to cluster reduced embeddings or original embeddings, by default False
**kwargs – Additional arguments passed to the clustering algorithm. For agglomerative: distance_threshold (float), linkage (str), affinity (str) For dbscan: eps (float), min_samples (int) For fuzzy_cmeans: m (float, fuzziness parameter), error (float), maxiter (int) For spectral: affinity (str), n_neighbors (int)
- Returns:
Cluster labels array of shape (n_samples,)
- Return type:
np.ndarray
- Raises:
ClusteringError – If embeddings not loaded or clustering fails
Examples
>>> # Agglomerative with distance threshold >>> cm.cluster(method='agglomerative', distance_threshold=0.5, n_clusters=None)
>>> # Fuzzy C-Means >>> cm.cluster(method='fuzzy_cmeans', n_clusters=5, m=2.0)
>>> # Spectral clustering >>> cm.cluster(method='spectral', n_clusters=5)
- get_hierarchy_level_clusters(level=0)[source]
Get clusters at a specific hierarchy level for agglomerative clustering.
- Parameters:
level (int, optional) – Hierarchy level (0 = leaf level, higher = more merged), by default 0
- Returns:
Dictionary containing: - clusters: List of cluster information at the level - level: The requested level - max_level: Maximum available level
- Return type:
- Raises:
ClusteringError – If hierarchy not available
- generate_hierarchical_labels(use_llm=True, max_keywords=5, llm_level=8)[source]
Generate labels for all levels of the hierarchy.
Uses a tiered approach when use_llm is True: - Levels 0-llm_level: Simple fallback (concatenation) for fast processing - Level llm_level: Full keyword extraction + LLM label generation - Levels llm_level+: LLM-based parent label generation from child labels
- Parameters:
- Returns:
Dictionary mapping node IDs to labels
- Return type:
- Raises:
ClusteringError – If hierarchy not available
- get_cluster_statistics()[source]
Get statistics about the clustering results.
- Returns:
Dictionary containing cluster statistics: - n_clusters: Number of clusters - n_noise: Number of noise points (for DBSCAN) - cluster_sizes: Dictionary mapping cluster labels to sizes - cluster_centers: Cluster centers (if available)
- Return type:
- Raises:
ClusteringError – If clustering has not been performed
- extract_cluster_keywords(n_keywords=10, min_df=2)[source]
Extract distinctive keywords for each cluster using TF-IDF.
- Parameters:
- Returns:
Dictionary mapping cluster labels to lists of keywords
- Return type:
- Raises:
ClusteringError – If clustering has not been performed or metadata is missing
Examples
>>> cm = ClusteringManager(em) >>> cm.load_embeddings() >>> cm.cluster(method='kmeans', n_clusters=5) >>> keywords = cm.extract_cluster_keywords(n_keywords=10) >>> print(f"Cluster 0 keywords: {keywords[0]}")
- generate_cluster_labels(use_llm=True, max_keywords=5)[source]
Generate descriptive labels for clusters.
This method can either use an LLM to generate meaningful labels based on cluster keywords and representative papers, or simply concatenate keywords.
- Parameters:
- Returns:
Dictionary mapping cluster labels to descriptive names
- Return type:
- Raises:
ClusteringError – If clustering or keyword extraction has not been performed
Examples
>>> cm = ClusteringManager(em) >>> cm.load_embeddings() >>> cm.cluster(method='kmeans', n_clusters=5) >>> cm.extract_cluster_keywords() >>> labels = cm.generate_cluster_labels(use_llm=True) >>> print(f"Cluster 0 label: {labels[0]}")
- get_cluster_representative_papers(n_papers=5)[source]
Find representative papers for each cluster.
Representative papers are those closest to the cluster centroid in the embedding space.
- Parameters:
n_papers (int, optional) – Number of representative papers per cluster, by default 5
- Returns:
Dictionary mapping cluster labels to lists of representative paper metadata
- Return type:
- Raises:
ClusteringError – If clustering has not been performed
Examples
>>> cm = ClusteringManager(em) >>> cm.load_embeddings() >>> cm.cluster(method='kmeans', n_clusters=5) >>> representatives = cm.get_cluster_representative_papers(n_papers=3) >>> print(f"Cluster 0 representatives: {representatives[0]}")
- get_clustering_results(include_metadata=True, max_title_length=100)[source]
Get complete clustering results for visualization.
- Parameters:
- Returns:
Dictionary containing: - points: List of points with coordinates, cluster labels, and metadata - statistics: Cluster statistics - n_dimensions: Number of dimensions in reduced embeddings - cluster_labels: Human-readable names for clusters (if generated) - cluster_keywords: Keywords for each cluster (if extracted)
- Return type:
- Raises:
ClusteringError – If required data not available
- export_to_json(output_path, include_metadata=True)[source]
Export clustering results to JSON file.
- Parameters:
- Raises:
ClusteringError – If export fails
- Return type:
- abstracts_explorer.clustering.perform_clustering(collection_name='papers', reduction_method='pca', n_components=2, clustering_method='kmeans', n_clusters=None, output_path=None, random_state=42, limit=None, **kwargs)[source]
Perform complete clustering pipeline and optionally export results.
This is a convenience function that handles the full clustering workflow: 1. Load embeddings from ChromaDB 2. Cluster on full embeddings 3. Apply dimensionality reduction for visualization 4. Export results if requested
- Parameters:
collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”
reduction_method (str, optional) – Dimensionality reduction method (‘pca’ or ‘tsne’) for visualization, by default ‘pca’
n_components (int, optional) – Number of components for dimensionality reduction, by default 2
clustering_method (str, optional) – Clustering method (‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘fuzzy_cmeans’, or ‘spectral’), by default ‘kmeans’
n_clusters (int, optional) – Number of clusters (for kmeans and agglomerative). If None, automatically calculated as n_papers / 100, clamped to [2, 500]. By default None.
output_path (str or Path, optional) – Path to export JSON results. If None, don’t export.
random_state (int, optional) – Random state for reproducibility, by default 42
limit (int, optional) – Maximum number of embeddings to process. If None, process all.
**kwargs – Additional arguments passed to clustering algorithm
- Returns:
Clustering results dictionary
- Return type:
- Raises:
ClusteringError – If any step fails
Examples
>>> results = perform_clustering( ... reduction_method="tsne", ... clustering_method="kmeans", ... n_clusters=5, ... output_path="clusters.json" ... ) >>> print(f"Found {results['statistics']['n_clusters']} clusters")
- abstracts_explorer.clustering.compute_clusters_with_cache(embeddings_manager, database, embedding_model, reduction_method='pca', n_components=2, clustering_method='kmeans', n_clusters=None, limit=None, force=False, conferences=None, years=None, **clustering_kwargs)[source]
Compute clusters with caching support.
This function checks the cache first and returns cached results if available. If cache miss or forced recompute, it performs clustering and saves to cache.
- Parameters:
embeddings_manager (EmbeddingsManager) – Embeddings manager instance
database (DatabaseManager) – Database manager for cache operations
embedding_model (str) – Current embedding model name
reduction_method (str, optional) – Dimensionality reduction method, by default “pca”
n_components (int, optional) – Number of components for reduction, by default 2
clustering_method (str, optional) – Clustering method to use, by default “kmeans”
n_clusters (int, optional) – Number of clusters. If None, auto-calculated based on data size
limit (int, optional) – Maximum number of embeddings to process
force (bool, optional) – Force recompute even if cache exists, by default False
conferences (list of str, optional) – Filter to only cluster papers from these conferences.
years (list of int, optional) – Filter to only cluster papers from these years.
**clustering_kwargs – Additional clustering parameters (e.g., eps, min_samples for DBSCAN)
- Returns:
Clustering results with points, statistics, and metadata
- Return type:
- Raises:
ClusteringError – If clustering fails
Examples
>>> results = compute_clusters_with_cache( ... em, db, "text-embedding-model", ... clustering_method="kmeans", ... n_clusters=5 ... )