# Codebase Architecture & Flow-Charts
This document provides a comprehensive architectural overview of the Abstracts Explorer
codebase, including module relationships, data flow, and function-level flow-charts.
## Module Overview
The codebase comprises **~21,800 lines** of Python across 20 source files:
| Module | Lines | Responsibility |
|--------|------:|----------------|
| `cli.py` | 3,446 | Command-line interface (25 commands) |
| `database.py` | 2,943 | SQL database operations (SQLAlchemy ORM) |
| `clustering.py` | 2,094 | Clustering algorithms & visualization |
| `registry.py` | 1,664 | OCI registry upload/download |
| `embeddings.py` | 1,372 | ChromaDB vector embeddings |
| `mcp_server.py` | 1,328 | MCP server & tool implementations |
| `web_ui/app.py` | 1,120 | Flask web API (22 routes) |
| `plugin.py` | 976 | Plugin system & data models |
| `rag.py` | 874 | RAG chat with Pydantic AI |
| `evaluation.py` | 833 | RAG evaluation framework |
| `mcp_tools.py` | 824 | MCP tool schema & formatting |
| `export_utils.py` | 675 | Paper export (ZIP/Markdown) |
| `config.py` | 532 | Configuration management |
| `db_models.py` | 435 | SQLAlchemy ORM models |
| `paper_utils.py` | 247 | Paper formatting utilities |
| `plugins/*.py` | ~2,177 | Conference downloader plugins (7 plugins) |
## High-Level Module Dependency Graph
```{mermaid}
graph TD
CLI[cli.py
3446 lines
25 commands]
CFG[config.py
532 lines]
DB[database.py
2943 lines]
DBM[db_models.py
435 lines]
EMB[embeddings.py
1372 lines]
CLU[clustering.py
2094 lines]
RAG[rag.py
874 lines]
MCP[mcp_server.py
1328 lines]
MCT[mcp_tools.py
824 lines]
WEB[web_ui/app.py
1120 lines]
PLG[plugin.py
976 lines]
PLI[plugins/*
7 downloaders]
EXP[export_utils.py
675 lines]
PAP[paper_utils.py
247 lines]
REG[registry.py
1664 lines]
EVL[evaluation.py
833 lines]
CLI --> CFG
CLI --> DB
CLI --> EMB
CLI --> CLU
CLI --> RAG
CLI --> MCP
CLI --> PLG
CLI --> EVL
CLI --> WEB
CLI --> REG
WEB --> DB
WEB --> EMB
WEB --> RAG
WEB --> CFG
WEB --> PAP
WEB --> EXP
WEB --> CLU
RAG --> CFG
RAG --> MCP
RAG --> MCT
MCP --> EMB
MCP --> DB
MCP --> CLU
MCP --> CFG
MCT --> MCP
EMB --> CFG
EMB --> DB
EMB --> PLG
EMB --> PAP
CLU --> EMB
CLU --> DB
CLU --> CFG
DB --> CFG
DB --> PLG
DB --> DBM
REG --> DB
REG --> EMB
REG --> DBM
REG --> CFG
EVL --> CFG
EVL --> DB
EVL --> EMB
EVL --> MCT
EVL --> RAG
PLI --> PLG
style CLI fill:#ff9999,stroke:#cc0000
style DB fill:#99ccff,stroke:#0066cc
style EMB fill:#99ffcc,stroke:#00cc66
style CLU fill:#ffcc99,stroke:#cc6600
style WEB fill:#cc99ff,stroke:#6600cc
style RAG fill:#ffff99,stroke:#cccc00
style MCP fill:#ff99cc,stroke:#cc0066
style REG fill:#99ffff,stroke:#00cccc
```
## CLI Command Flow
The CLI (`cli.py`) is the primary entry point with 25 command functions. All commands
follow a repeated pattern: parse args → load config → initialize resources → execute → handle errors.
```{mermaid}
graph TD
MAIN["main()"] --> PARSE[Parse Arguments]
PARSE --> DISPATCH{Command?}
DISPATCH -->|download| DL[download_command]
DISPATCH -->|create-embeddings| CE[create_embeddings_command]
DISPATCH -->|pre-process| PP[pre_process_command]
DISPATCH -->|search| SR[search_command]
DISPATCH -->|chat| CH[chat_command]
DISPATCH -->|web-ui| WU[web_ui_command]
DISPATCH -->|clustering run| CL[cluster_embeddings_command]
DISPATCH -->|clustering clear-cache| CC[clear_clustering_cache_command]
DISPATCH -->|clustering pre-generate| PG[pre_generate_clustering_command]
DISPATCH -->|delete-data| DD[delete_data_command]
DISPATCH -->|mcp-server| MS[mcp_server_command]
DISPATCH -->|eval generate| EG[eval_generate_command]
DISPATCH -->|eval verify| EV[eval_verify_command]
DISPATCH -->|eval run| ER[eval_run_command]
DISPATCH -->|eval results| ERS[eval_results_command]
DISPATCH -->|eval clear| ECL[eval_clear_command]
DISPATCH -->|registry upload| RU[registry_upload_command]
DISPATCH -->|registry download| RD[registry_download_command]
DISPATCH -->|registry list| RL[registry_list_command]
DISPATCH -->|registry delete| RDEL[registry_delete_command]
subgraph "Repeated CLI Pattern (⚠ duplicated in each command)"
A1[Load config] --> A2[Resolve conference/year]
A2 --> A3[Print header banner]
A3 --> A4[Validate embeddings DB]
A4 --> A5[Init EmbeddingsManager]
A5 --> A6[Test LM Studio connection]
A6 --> A7[Init DatabaseManager]
A7 --> A8[Execute command logic]
A8 --> A9[Error handling + traceback]
end
DL --> A1
CE --> A1
SR --> A1
CH --> A1
CL --> A1
PG --> A1
style A1 fill:#ffcccc
style A2 fill:#ffcccc
style A3 fill:#ffcccc
style A4 fill:#ffcccc
style A5 fill:#ffcccc
style A6 fill:#ffcccc
style A7 fill:#ffcccc
style A9 fill:#ffcccc
```
## Data Pipeline Flow
```{mermaid}
graph LR
subgraph "Data Ingestion"
API[Conference APIs
OpenReview, etc.] -->|download| PLG[Plugin System]
PLG -->|LightweightPaper| DB[(SQL Database
SQLite/PostgreSQL)]
end
subgraph "Embedding Generation"
DB -->|paper text| EMB[EmbeddingsManager]
LLM1[LLM Backend
LM Studio/Blablador] -->|embedding vectors| EMB
EMB -->|store| CHR[(ChromaDB)]
end
subgraph "Analysis & Search"
CHR -->|embeddings| CLU[ClusteringManager]
CLU -->|cache results| DB
CHR -->|similarity search| SEM[Semantic Search]
DB -->|keyword search| KW[Keyword Search]
end
subgraph "User Interfaces"
SEM --> WEB[Web UI]
KW --> WEB
CLU --> WEB
SEM --> RAG[RAG Chat]
CLU --> RAG
RAG --> MCP[MCP Tools]
WEB --> EXP[Export ZIP/Markdown]
end
subgraph "Registry"
DB -->|export| REG[OCI Registry]
CHR -->|export| REG
REG -->|import| DB
REG -->|import| CHR
end
```
## Database Layer Flow
```{mermaid}
graph TD
subgraph "DatabaseManager - 2943 lines"
direction TB
CONN["Connection Management
__init__, connect, close, __enter__, __exit__"]
subgraph "⚠ Duplicate: Session Validation"
SV1["if not self._session: raise DatabaseError
(repeated in 30+ methods)"]
end
subgraph "Paper Operations"
ADD["add_paper / add_papers"]
SEARCH["search_papers
(flexible filter builder)"]
KWSEARCH["search_papers_keyword
(⚠ thin wrapper around search_papers)"]
AUTHSEARCH["search_authors_in_papers"]
end
subgraph "⚠ Duplicate: Faceting Methods"
SESS["get_sessions(conference, year)"]
CONF["get_conferences(year)"]
YEARS["get_years(conference)"]
YFC["get_years_for_conference(conference)
⚠ duplicate of get_years"]
CY["get_conference_years_from_db()"]
end
subgraph "Clustering Cache"
CGET["get_clustering_cache"]
CSAVE["save_clustering_cache"]
CDEL["delete_clustering_cache_by_conference_year"]
CCNT["count_clustering_cache_by_conference_year"]
CCLR["clear_clustering_cache"]
end
subgraph "⚠ Duplicate: CRUD Patterns"
direction LR
QA["Eval QA Pairs
add, get, count, update, delete"]
ER["Eval Results
add, get, delete"]
end
subgraph "Import/Export"
EXPS["export_papers_to_sqlite"]
IMPS["import_papers_from_sqlite"]
EXPC["export_clustering_cache_to_json"]
IMPC["import_clustering_cache_from_json"]
end
end
CONN --> SV1
SV1 --> ADD
SV1 --> SEARCH
SEARCH --> KWSEARCH
SV1 --> SESS
SV1 --> CONF
SV1 --> YEARS
YEARS -.->|"duplicate"| YFC
SV1 --> CGET
SV1 --> QA
SV1 --> ER
SV1 --> EXPS
style YFC fill:#ffcccc,stroke:#cc0000
style SV1 fill:#ffcccc,stroke:#cc0000
style KWSEARCH fill:#ffffcc,stroke:#cccc00
```
## Embeddings & Clustering Flow
```{mermaid}
graph TD
subgraph "EmbeddingsManager"
ECONN["connect() → ChromaDB"]
ECOL["create_collection()"]
EADD["add_paper → generate_embedding → store"]
ESRCH["search_similar / search_papers_semantic"]
EFIND["find_papers_within_distance"]
EIMP["import/export_embeddings"]
subgraph "⚠ Duplicate: Where-clause building"
W1["search_papers_semantic:
build $and/$in filter"]
W2["find_papers_within_distance:
build $and/$in filter"]
end
end
subgraph "ClusteringManager"
CLOAD["load_embeddings"]
CREDUC["reduce_dimensions
(PCA / t-SNE / UMAP)"]
CCLUST["cluster
(K-Means / DBSCAN / Agglom / Spectral / Fuzzy)"]
CKEYS["extract_cluster_keywords"]
CLBL["generate_cluster_labels"]
CHRCH["generate_hierarchical_labels"]
CRES["get_clustering_results"]
subgraph "⚠ Duplicate: TF-IDF extraction"
T1["extract_cluster_keywords"]
T2["_extract_keywords_for_samples
⚠ nearly identical to T1"]
end
subgraph "⚠ Duplicate: LLM label generation"
L1["_generate_llm_label"]
L2["_generate_parent_label_llm
⚠ same pattern as L1"]
L3["_generate_llm_label_from_keywords
⚠ same pattern as L1"]
end
subgraph "⚠ Duplicate: Where-clause building"
W3["load_embeddings:
build $and/$in filter"]
end
end
subgraph "Standalone Functions"
PF["perform_clustering
(convenience wrapper)"]
CWC["compute_clusters_with_cache
(multi-level caching)"]
W4["⚠ also builds where-clauses"]
end
ECONN --> ECOL --> EADD
ECOL --> ESRCH
ECOL --> EFIND
CLOAD --> CREDUC --> CCLUST --> CKEYS --> CLBL
CCLUST --> CHRCH
CLBL --> CRES
PF --> CLOAD
CWC --> CLOAD
W1 -.->|"duplicate logic"| W2
W1 -.->|"duplicate logic"| W3
W1 -.->|"duplicate logic"| W4
T1 -.->|"duplicate logic"| T2
L1 -.->|"duplicate logic"| L2
L1 -.->|"duplicate logic"| L3
style W1 fill:#ffcccc,stroke:#cc0000
style W2 fill:#ffcccc,stroke:#cc0000
style W3 fill:#ffcccc,stroke:#cc0000
style W4 fill:#ffcccc,stroke:#cc0000
style T1 fill:#ffcccc,stroke:#cc0000
style T2 fill:#ffcccc,stroke:#cc0000
style L1 fill:#ffcccc,stroke:#cc0000
style L2 fill:#ffcccc,stroke:#cc0000
style L3 fill:#ffcccc,stroke:#cc0000
```
## RAG Chat & MCP Tools Flow
```{mermaid}
graph TD
subgraph "RAG Chat (rag.py)"
RQUERY["query()"]
RCHAT["chat()"]
RBUILD["_build_agent()"]
subgraph "⚠ Duplicate: Tool wrappers (6 identical patterns)"
TW1["_tool_search_papers"]
TW2["_tool_get_conference_topics"]
TW3["_tool_get_topic_evolution"]
TW4["_tool_analyze_topic_relevance"]
TW5["_tool_get_cluster_visualization"]
TW6["_tool_get_paper_details"]
end
end
subgraph "MCP Tools (mcp_tools.py)"
EXEC["execute_mcp_tool
(dispatcher)"]
SCHEMA["get_mcp_tools_schema"]
subgraph "⚠ Duplicate: Normalization (4 similar functions)"
N1["_normalize_search_papers_args"]
N2["_normalize_get_topic_evolution_args"]
N3["_normalize_analyze_topic_relevance_args"]
N4["_normalize_get_paper_details_args"]
end
subgraph "⚠ Duplicate: Formatters (6 similar functions)"
F1["_format_topic_relevance_result"]
F2["_format_conference_topics_result"]
F3["_format_topic_evolution_result"]
F4["_format_search_papers_result"]
F5["_format_visualization_result"]
F6["_format_paper_details_result"]
end
end
subgraph "MCP Server (mcp_server.py)"
subgraph "Core Tool Functions"
S1["search_papers"]
S2["get_conference_topics"]
S3["get_topic_evolution"]
S4["analyze_topic_relevance"]
S5["get_cluster_visualization"]
S6["get_paper_details"]
end
subgraph "⚠ Duplicate: Resource init"
RI["get_config → EmbeddingsManager → DatabaseManager
(repeated in each tool function)"]
end
subgraph "⚠ Duplicate: Where-clause merging"
WM1["merge_where_clause_with_conference"]
WM2["merge_where_clause_with_years"]
end
end
RBUILD --> TW1 & TW2 & TW3 & TW4 & TW5 & TW6
TW1 --> S1
TW2 --> S2
TW3 --> S3
TW4 --> S4
TW5 --> S5
TW6 --> S6
EXEC --> N1 --> S1
EXEC --> N2 --> S3
EXEC --> N3 --> S4
EXEC --> N4 --> S6
S1 --> WM1
S1 --> WM2
style TW1 fill:#ffcccc,stroke:#cc0000
style TW2 fill:#ffcccc,stroke:#cc0000
style TW3 fill:#ffcccc,stroke:#cc0000
style TW4 fill:#ffcccc,stroke:#cc0000
style TW5 fill:#ffcccc,stroke:#cc0000
style TW6 fill:#ffcccc,stroke:#cc0000
style N1 fill:#ffcccc,stroke:#cc0000
style N2 fill:#ffcccc,stroke:#cc0000
style N3 fill:#ffcccc,stroke:#cc0000
style N4 fill:#ffcccc,stroke:#cc0000
style F1 fill:#ffffcc,stroke:#cccc00
style F2 fill:#ffffcc,stroke:#cccc00
style F3 fill:#ffffcc,stroke:#cccc00
style F4 fill:#ffffcc,stroke:#cccc00
style F5 fill:#ffffcc,stroke:#cccc00
style F6 fill:#ffffcc,stroke:#cccc00
style RI fill:#ffcccc,stroke:#cc0000
```
## Web UI Request Flow
```{mermaid}
graph TD
subgraph "Flask Web UI (app.py - 22 routes)"
REQ[HTTP Request] --> MW["Middleware
ProxyFix, CORS, teardown_db"]
MW --> IDX["GET / → index()"]
MW --> CIDX["GET / → conference_index()"]
MW --> HLTH["GET /health"]
MW --> STAT["GET /api/stats"]
MW --> EMOD["GET /api/embedding-model-check"]
MW --> FILT["GET /api/filters"]
MW --> AFILT["GET /api/available-filters"]
MW --> SRCH["POST /api/search"]
MW --> GPAP["GET /api/paper/"]
MW --> BATCH["POST /api/papers/batch"]
MW --> CHAT["POST /api/chat"]
MW --> RSET["POST /api/chat/reset"]
MW --> CCMP["POST /api/clusters/compute"]
MW --> CCCH["GET /api/clusters/cached"]
MW --> CDEF["GET /api/clusters/default-count"]
MW --> CSRCH["POST /api/clusters/search"]
MW --> GYRS["GET /api/years"]
MW --> EXPRT["POST /api/export/interesting-papers"]
MW --> DOND["POST /api/donate-data"]
MW --> DONC["POST /api/donate-chat"]
subgraph "⚠ Repeated Pattern in All Routes"
P1["db = get_database()"]
P2["try/except + logger.error + jsonify"]
P3["Parameter validation + type conversion"]
end
end
SRCH -->|keyword| DB[(DatabaseManager)]
SRCH -->|semantic| EMB[(EmbeddingsManager)]
CHAT --> RAG[RAGChat]
CCMP --> CLU[ClusteringManager]
EXPRT --> EXU[export_utils]
GPAP --> PAP[paper_utils]
style P1 fill:#ffcccc,stroke:#cc0000
style P2 fill:#ffcccc,stroke:#cc0000
style P3 fill:#ffcccc,stroke:#cc0000
```
## Export & Paper Utilities Flow
```{mermaid}
graph TD
subgraph "export_utils.py"
EXP["export_papers_to_zip"] --> GFS["generate_folder_structure_export"]
GFS --> GMR["generate_main_readme"]
GFS --> GAP["generate_all_papers_markdown"]
GFS --> GST["generate_search_term_markdown"]
subgraph "⚠ Duplicate: Paper formatting"
GAP_F["generate_all_papers_markdown:
group by session → format paper block"]
GST_F["generate_search_term_markdown:
group by session → format paper block
⚠ nearly identical to GAP"]
end
end
subgraph "paper_utils.py"
GPA["get_paper_with_authors"] --> DB[(DatabaseManager)]
FSR["format_search_results"] --> GPA
BCP["build_context_from_papers"]
end
style GAP_F fill:#ffcccc,stroke:#cc0000
style GST_F fill:#ffcccc,stroke:#cc0000
```
## Registry Upload/Download Flow
```{mermaid}
graph LR
subgraph "RegistryClient"
UP["upload()"] --> EY["_export_year()"]
EY --> PT["_push_tag()"]
DL["download()"] --> FM["_find_best_matching_tag()"]
FM --> IY["_import_year()"]
IY --> CE["_check_embedding_model()"]
subgraph "⚠ Duplicate: Progress callbacks"
PC1["upload: def _progress(...)"]
PC2["download: def _progress(...)"]
PC3["_export_year: def _progress(...)"]
PC4["_import_year: def _progress(...)"]
end
end
EY --> DB[(DatabaseManager
export_papers_to_sqlite
export_clustering_cache_to_json)]
EY --> EMB[(EmbeddingsManager
export_embeddings)]
IY --> DB2[(DatabaseManager
import_papers_from_sqlite
import_clustering_cache_from_json)]
IY --> EMB2[(EmbeddingsManager
import_embeddings)]
style PC1 fill:#ffcccc,stroke:#cc0000
style PC2 fill:#ffcccc,stroke:#cc0000
style PC3 fill:#ffcccc,stroke:#cc0000
style PC4 fill:#ffcccc,stroke:#cc0000
```
## Summary of Duplicate Code Paths
The following table summarizes all identified duplicate code patterns, ordered by
severity:
| # | Pattern | Occurrences | Modules | Severity |
|---|---------|-------------|---------|----------|
| 1 | Session validation boilerplate (`if not self._session`) | 30+ | database.py | High |
| 2 | ChromaDB where-clause construction (`$and/$in` filters) | 5+ | embeddings.py, clustering.py, mcp_server.py | High |
| 3 | EmbeddingsManager initialization sequence | 6 | cli.py | High |
| 4 | LM Studio connection test + error message | 4 | cli.py | Medium |
| 5 | Embeddings DB path validation | 4 | cli.py | Medium |
| 6 | CLI command header printing | 15+ | cli.py | Medium |
| 7 | TF-IDF keyword extraction | 2 | clustering.py | Medium |
| 8 | LLM label generation (OpenAI call + fallback) | 3 | clustering.py | Medium |
| 9 | RAG tool wrapper functions (identical pattern) | 6 | rag.py | Medium |
| 10 | MCP argument normalization functions | 4 | mcp_tools.py | Medium |
| 11 | Resource init in MCP tools (config → embed → db) | 6 | mcp_server.py | Medium |
| 12 | Conference/year argument resolution | 8+ | cli.py | Low |
| 13 | Paper markdown formatting (session grouping) | 2 | export_utils.py | Low |
| 14 | Error handling + traceback pattern | 15+ | cli.py | Low |
| 15 | Faceting query pattern | 4 | database.py | Low |
| 16 | Eval CRUD parallel structures | 2×5 | database.py | Low |
| 17 | Progress callback definitions | 4 | registry.py | Low |
| 18 | Web route error handling (try/except/jsonify) | 22 | web_ui/app.py | Low |
| 19 | `get_years_for_conference` duplicates `get_years` | 2 | database.py | Low |