Clustering Module

The clustering module provides functionality to cluster and visualize paper embeddings using dimensionality reduction and clustering algorithms.

Features

Dimensionality reduction using PCA, t-SNE, and UMAP
Clustering using K-Means, DBSCAN, Agglomerative, Spectral, and Fuzzy C-Means
Automatic cluster labeling using TF-IDF and LLM-based methods
Keyword extraction for each cluster
Representative paper selection based on cluster centroids
Hierarchical cluster structure for agglomerative clustering
Export clustering results to JSON for visualization

Quick Start

from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
from abstracts_explorer.clustering import ClusteringManager

# Initialize managers
em = EmbeddingsManager()
db = DatabaseManager()

# Create clustering manager
cm = ClusteringManager(em, db)

# Load embeddings
n_papers = cm.load_embeddings()
print(f"Loaded {n_papers} papers")

# Reduce dimensions and cluster
cm.reduce_dimensions(method="pca", n_components=2)
cm.cluster(method="kmeans", n_clusters=8)

# Generate labels and get results
cm.extract_cluster_keywords()
cm.generate_cluster_labels(use_llm=True)
results = cm.get_clustering_results()

API Reference

This module provides functionality to cluster and visualize paper embeddings using dimensionality reduction and clustering algorithms from scikit-learn.

Features: - Dimensionality reduction using PCA and t-SNE - Clustering using K-Means, DBSCAN, Agglomerative, Fuzzy C-Means, and Spectral clustering - NEW: Automatic cluster labeling using TF-IDF and LLM-based methods - NEW: Keyword extraction for each cluster - NEW: Representative paper selection based on cluster centroids - NEW: Hierarchical cluster structure for agglomerative clustering - Export clustering results to JSON for visualization

Cluster Labeling

The module now includes state-of-the-art cluster labeling functionality that: 1. Extracts distinctive keywords for each cluster using TF-IDF analysis 2. Generates human-readable labels using LLM (Large Language Model) integration 3. Identifies representative papers closest to each cluster’s centroid

Hierarchical Clustering

When using agglomerative clustering with distance_threshold, the module tracks the hierarchical structure of clusters, allowing exploration of sub-clusters.

Example

>>> from abstracts_explorer.clustering import ClusteringManager
>>> from abstracts_explorer.embeddings import EmbeddingsManager
>>>
>>> # Initialize managers
>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> cm = ClusteringManager(em)
>>>
>>> # Load and cluster embeddings
>>> cm.load_embeddings()
>>> cm.cluster(method='kmeans', n_clusters=5)
>>> cm.reduce_dimensions(method='pca', n_components=2)
>>>
>>> # Generate cluster labels
>>> cm.extract_cluster_keywords(n_keywords=10)
>>> cm.generate_cluster_labels(use_llm=True)
>>>
>>> # Get results with labels
>>> results = cm.get_clustering_results()
>>> print(results['cluster_labels'])  # Shows generated labels
>>> print(results['cluster_keywords'])  # Shows extracted keywords

exception abstracts_explorer.clustering.ClusteringError[source]

Bases: Exception

Exception raised for clustering operations.

abstracts_explorer.clustering.calculate_default_clusters(n_papers, min_clusters=2, max_clusters=500)[source]

Calculate default number of clusters based on the number of papers.

Uses the rule: n_clusters = n_papers / 100, clamped to [min_clusters, max_clusters].

Parameters:

n_papers (int) – Number of papers to cluster
min_clusters (int, optional) – Minimum number of clusters, by default 2
max_clusters (int, optional) – Maximum number of clusters, by default 500

Returns:

Recommended number of clusters

Return type:

int

Examples

>>> calculate_default_clusters(50)
2
>>> calculate_default_clusters(500)
5
>>> calculate_default_clusters(100000)
500

class abstracts_explorer.clustering.ClusteringManager(embeddings_manager, database=None)[source]

Bases: object

Manager for clustering and dimensionality reduction of embeddings.

This class handles: - Loading embeddings from ChromaDB - Dimensionality reduction (PCA, t-SNE) - Clustering (K-Means, DBSCAN, Agglomerative, Fuzzy C-Means, Spectral) - Automatic cluster labeling using TF-IDF and LLM - Keyword extraction for clusters - Representative paper selection - Hierarchical cluster structure tracking - Export of results for visualization

Parameters:

embeddings_manager (EmbeddingsManager) – Embeddings manager instance to load embeddings from
database (DatabaseManager, optional) – Database manager for fetching paper metadata

Variables:

embeddings_manager (EmbeddingsManager) – The embeddings manager instance
database (DatabaseManager or None) – The database manager instance
embeddings (np.ndarray or None) – The loaded embeddings array
paper_ids (list or None) – The paper IDs corresponding to embeddings
metadatas (list or None) – The paper metadata corresponding to embeddings
reduced_embeddings (np.ndarray or None) – The reduced dimensionality embeddings
cluster_labels (np.ndarray or None) – The cluster assignment labels
cluster_label_names (dict or None) – Human-readable names for each cluster
cluster_keywords (dict or None) – Keywords extracted for each cluster
cluster_summaries (dict or None) – Summaries generated for each cluster
cluster_hierarchy (dict or None) – Hierarchical structure of clusters (for agglomerative)
fuzzy_memberships (np.ndarray or None) – Fuzzy membership values (for fuzzy c-means)

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> em.create_collection()
>>> cm = ClusteringManager(em)
>>> cm.load_embeddings()
>>> reduced = cm.reduce_dimensions(method='pca', n_components=2)
>>> labels = cm.cluster(method='kmeans', n_clusters=5)
>>> cm.extract_cluster_keywords()
>>> cm.generate_cluster_labels(use_llm=True)
>>> results = cm.get_clustering_results()

__init__(embeddings_manager, database=None)[source]

Initialize the ClusteringManager.

Parameters:

embeddings_manager (EmbeddingsManager) – Embeddings manager instance to load embeddings from
database (DatabaseManager, optional) – Database manager for fetching paper metadata

load_embeddings(limit=None, conferences=None, years=None)[source]

Load embeddings from ChromaDB collection.

Parameters:

limit (int, optional) – Maximum number of embeddings to load. If None, load all.
conferences (list of str, optional) – Filter to only load embeddings for these conferences.
years (list of int, optional) – Filter to only load embeddings for these years.

Returns:

Number of embeddings loaded

Return type:

int

Raises:

ClusteringError – If loading fails or collection is empty

reduce_dimensions(method='pca', n_components=2, random_state=42, **kwargs)[source]

Reduce dimensionality of embeddings.

Parameters:

method (str, optional) – Dimensionality reduction method: ‘pca’ or ‘tsne’, by default ‘pca’
n_components (int, optional) – Number of components to reduce to, by default 2
random_state (int, optional) – Random state for reproducibility, by default 42
**kwargs – Additional arguments passed to the reduction algorithm

Returns:

Reduced embeddings array of shape (n_samples, n_components)

Return type:

np.ndarray

Raises:

ClusteringError – If embeddings not loaded or reduction fails

cluster(method='kmeans', n_clusters=None, random_state=42, use_reduced=False, **kwargs)[source]

Cluster embeddings using specified algorithm.

Parameters:

method (str, optional) – Clustering method: ‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘fuzzy_cmeans’, or ‘spectral’. By default ‘kmeans’.
n_clusters (int, optional) – Number of clusters (for kmeans, agglomerative, fuzzy_cmeans, and spectral). For agglomerative, can be None if distance_threshold is provided. If None, automatically calculated as n_papers / 100, clamped to [2, 500]. By default None.
random_state (int, optional) – Random state for reproducibility, by default 42
use_reduced (bool, optional) – Whether to cluster reduced embeddings or original embeddings, by default False
**kwargs – Additional arguments passed to the clustering algorithm. For agglomerative: distance_threshold (float), linkage (str), affinity (str) For dbscan: eps (float), min_samples (int) For fuzzy_cmeans: m (float, fuzziness parameter), error (float), maxiter (int) For spectral: affinity (str), n_neighbors (int)

Returns:

Cluster labels array of shape (n_samples,)

Return type:

np.ndarray

Raises:

ClusteringError – If embeddings not loaded or clustering fails

Examples

>>> # Agglomerative with distance threshold
>>> cm.cluster(method='agglomerative', distance_threshold=0.5, n_clusters=None)

>>> # Fuzzy C-Means
>>> cm.cluster(method='fuzzy_cmeans', n_clusters=5, m=2.0)

>>> # Spectral clustering
>>> cm.cluster(method='spectral', n_clusters=5)

get_hierarchy_level_clusters(level=0)[source]

Get clusters at a specific hierarchy level for agglomerative clustering.

Parameters:: level (int, optional) – Hierarchy level (0 = leaf level, higher = more merged), by default 0
Returns:: Dictionary containing: - clusters: List of cluster information at the level - level: The requested level - max_level: Maximum available level
Return type:: dict
Raises:: ClusteringError – If hierarchy not available

generate_hierarchical_labels(use_llm=True, max_keywords=5, llm_level=8)[source]

Generate labels for all levels of the hierarchy.

Uses a tiered approach when use_llm is True: - Levels 0-llm_level: Simple fallback (concatenation) for fast processing - Level llm_level: Full keyword extraction + LLM label generation - Levels llm_level+: LLM-based parent label generation from child labels

Parameters:

use_llm (bool, optional) – Whether to use LLM for label generation, by default True
max_keywords (int, optional) – Maximum number of keywords to use in label generation, by default 5

Returns:

Dictionary mapping node IDs to labels

Return type:

Dict[int, str]

Raises:

ClusteringError – If hierarchy not available

get_cluster_statistics()[source]

Get statistics about the clustering results.

Returns:: Dictionary containing cluster statistics: - n_clusters: Number of clusters - n_noise: Number of noise points (for DBSCAN) - cluster_sizes: Dictionary mapping cluster labels to sizes - cluster_centers: Cluster centers (if available)
Return type:: dict
Raises:: ClusteringError – If clustering has not been performed

extract_cluster_keywords(n_keywords=10, min_df=2)[source]

Extract distinctive keywords for each cluster using TF-IDF.

Parameters:

n_keywords (int, optional) – Number of top keywords to extract per cluster, by default 10
min_df (int, optional) – Minimum document frequency for a term to be considered, by default 2

Returns:

Dictionary mapping cluster labels to lists of keywords

Return type:

Dict[int, List[str]]

Raises:

ClusteringError – If clustering has not been performed or metadata is missing

Examples

>>> cm = ClusteringManager(em)
>>> cm.load_embeddings()
>>> cm.cluster(method='kmeans', n_clusters=5)
>>> keywords = cm.extract_cluster_keywords(n_keywords=10)
>>> print(f"Cluster 0 keywords: {keywords[0]}")

generate_cluster_labels(use_llm=True, max_keywords=5)[source]

Generate descriptive labels for clusters.

This method can either use an LLM to generate meaningful labels based on cluster keywords and representative papers, or simply concatenate keywords.

Parameters:

use_llm (bool, optional) – Whether to use LLM for label generation, by default True
max_keywords (int, optional) – Maximum number of keywords to use in label generation, by default 5

Returns:

Dictionary mapping cluster labels to descriptive names

Return type:

Dict[int, str]

Raises:

ClusteringError – If clustering or keyword extraction has not been performed

Examples

>>> cm = ClusteringManager(em)
>>> cm.load_embeddings()
>>> cm.cluster(method='kmeans', n_clusters=5)
>>> cm.extract_cluster_keywords()
>>> labels = cm.generate_cluster_labels(use_llm=True)
>>> print(f"Cluster 0 label: {labels[0]}")

get_cluster_representative_papers(n_papers=5)[source]

Find representative papers for each cluster.

Representative papers are those closest to the cluster centroid in the embedding space.

Parameters:: n_papers (int, optional) – Number of representative papers per cluster, by default 5
Returns:: Dictionary mapping cluster labels to lists of representative paper metadata
Return type:: Dict[int, List[Dict[str, Any]]]
Raises:: ClusteringError – If clustering has not been performed

Examples

>>> cm = ClusteringManager(em)
>>> cm.load_embeddings()
>>> cm.cluster(method='kmeans', n_clusters=5)
>>> representatives = cm.get_cluster_representative_papers(n_papers=3)
>>> print(f"Cluster 0 representatives: {representatives[0]}")

get_clustering_results(include_metadata=True, max_title_length=100)[source]

Get complete clustering results for visualization.

Parameters:

include_metadata (bool, optional) – Whether to include paper metadata, by default True
max_title_length (int, optional) – Maximum length for paper titles, by default 100

Returns:

Dictionary containing: - points: List of points with coordinates, cluster labels, and metadata - statistics: Cluster statistics - n_dimensions: Number of dimensions in reduced embeddings - cluster_labels: Human-readable names for clusters (if generated) - cluster_keywords: Keywords for each cluster (if extracted)

Return type:

dict

Raises:

ClusteringError – If required data not available

export_to_json(output_path, include_metadata=True)[source]

Export clustering results to JSON file.

Parameters:

output_path (str or Path) – Path to output JSON file
include_metadata (bool, optional) – Whether to include paper metadata, by default True

Raises:

ClusteringError – If export fails

Return type:

None

abstracts_explorer.clustering.perform_clustering(collection_name='papers', reduction_method='pca', n_components=2, clustering_method='kmeans', n_clusters=None, output_path=None, random_state=42, limit=None, **kwargs)[source]

Perform complete clustering pipeline and optionally export results.

This is a convenience function that handles the full clustering workflow: 1. Load embeddings from ChromaDB 2. Cluster on full embeddings 3. Apply dimensionality reduction for visualization 4. Export results if requested

Parameters:

collection_name (str, optional) – Name of the ChromaDB collection, by default “papers”
reduction_method (str, optional) – Dimensionality reduction method (‘pca’ or ‘tsne’) for visualization, by default ‘pca’
n_components (int, optional) – Number of components for dimensionality reduction, by default 2
clustering_method (str, optional) – Clustering method (‘kmeans’, ‘dbscan’, ‘agglomerative’, ‘fuzzy_cmeans’, or ‘spectral’), by default ‘kmeans’
n_clusters (int, optional) – Number of clusters (for kmeans and agglomerative). If None, automatically calculated as n_papers / 100, clamped to [2, 500]. By default None.
output_path (str or Path, optional) – Path to export JSON results. If None, don’t export.
random_state (int, optional) – Random state for reproducibility, by default 42
limit (int, optional) – Maximum number of embeddings to process. If None, process all.
**kwargs – Additional arguments passed to clustering algorithm

Returns:

Clustering results dictionary

Return type:

dict

Raises:

ClusteringError – If any step fails

Examples

>>> results = perform_clustering(
...     reduction_method="tsne",
...     clustering_method="kmeans",
...     n_clusters=5,
...     output_path="clusters.json"
... )
>>> print(f"Found {results['statistics']['n_clusters']} clusters")

abstracts_explorer.clustering.compute_clusters_with_cache(embeddings_manager, database, embedding_model, reduction_method='pca', n_components=2, clustering_method='kmeans', n_clusters=None, limit=None, force=False, conferences=None, years=None, **clustering_kwargs)[source]

Compute clusters with caching support.

This function checks the cache first and returns cached results if available. If cache miss or forced recompute, it performs clustering and saves to cache.

Parameters:

embeddings_manager (EmbeddingsManager) – Embeddings manager instance
database (DatabaseManager) – Database manager for cache operations
embedding_model (str) – Current embedding model name
reduction_method (str, optional) – Dimensionality reduction method, by default “pca”
n_components (int, optional) – Number of components for reduction, by default 2
clustering_method (str, optional) – Clustering method to use, by default “kmeans”
n_clusters (int, optional) – Number of clusters. If None, auto-calculated based on data size
limit (int, optional) – Maximum number of embeddings to process
force (bool, optional) – Force recompute even if cache exists, by default False
conferences (list of str, optional) – Filter to only cluster papers from these conferences.
years (list of int, optional) – Filter to only cluster papers from these years.
**clustering_kwargs – Additional clustering parameters (e.g., eps, min_samples for DBSCAN)

Returns:

Clustering results with points, statistics, and metadata

Return type:

dict

Raises:

ClusteringError – If clustering fails

Examples

>>> results = compute_clusters_with_cache(
...     em, db, "text-embedding-model",
...     clustering_method="kmeans",
...     n_clusters=5
... )