Codebase Architecture & Flow-Charts
This document provides a comprehensive architectural overview of the Abstracts Explorer codebase, including module relationships, data flow, and function-level flow-charts.
Module Overview
The codebase comprises ~21,800 lines of Python across 20 source files:
Module |
Lines |
Responsibility |
|---|---|---|
|
3,446 |
Command-line interface (25 commands) |
|
2,943 |
SQL database operations (SQLAlchemy ORM) |
|
2,094 |
Clustering algorithms & visualization |
|
1,664 |
OCI registry upload/download |
|
1,372 |
ChromaDB vector embeddings |
|
1,328 |
MCP server & tool implementations |
|
1,120 |
Flask web API (22 routes) |
|
976 |
Plugin system & data models |
|
874 |
RAG chat with Pydantic AI |
|
833 |
RAG evaluation framework |
|
824 |
MCP tool schema & formatting |
|
675 |
Paper export (ZIP/Markdown) |
|
532 |
Configuration management |
|
435 |
SQLAlchemy ORM models |
|
247 |
Paper formatting utilities |
|
~2,177 |
Conference downloader plugins (7 plugins) |
High-Level Module Dependency Graph
graph TD
CLI[cli.py<br/>3446 lines<br/>25 commands]
CFG[config.py<br/>532 lines]
DB[database.py<br/>2943 lines]
DBM[db_models.py<br/>435 lines]
EMB[embeddings.py<br/>1372 lines]
CLU[clustering.py<br/>2094 lines]
RAG[rag.py<br/>874 lines]
MCP[mcp_server.py<br/>1328 lines]
MCT[mcp_tools.py<br/>824 lines]
WEB[web_ui/app.py<br/>1120 lines]
PLG[plugin.py<br/>976 lines]
PLI[plugins/*<br/>7 downloaders]
EXP[export_utils.py<br/>675 lines]
PAP[paper_utils.py<br/>247 lines]
REG[registry.py<br/>1664 lines]
EVL[evaluation.py<br/>833 lines]
CLI --> CFG
CLI --> DB
CLI --> EMB
CLI --> CLU
CLI --> RAG
CLI --> MCP
CLI --> PLG
CLI --> EVL
CLI --> WEB
CLI --> REG
WEB --> DB
WEB --> EMB
WEB --> RAG
WEB --> CFG
WEB --> PAP
WEB --> EXP
WEB --> CLU
RAG --> CFG
RAG --> MCP
RAG --> MCT
MCP --> EMB
MCP --> DB
MCP --> CLU
MCP --> CFG
MCT --> MCP
EMB --> CFG
EMB --> DB
EMB --> PLG
EMB --> PAP
CLU --> EMB
CLU --> DB
CLU --> CFG
DB --> CFG
DB --> PLG
DB --> DBM
REG --> DB
REG --> EMB
REG --> DBM
REG --> CFG
EVL --> CFG
EVL --> DB
EVL --> EMB
EVL --> MCT
EVL --> RAG
PLI --> PLG
style CLI fill:#ff9999,stroke:#cc0000
style DB fill:#99ccff,stroke:#0066cc
style EMB fill:#99ffcc,stroke:#00cc66
style CLU fill:#ffcc99,stroke:#cc6600
style WEB fill:#cc99ff,stroke:#6600cc
style RAG fill:#ffff99,stroke:#cccc00
style MCP fill:#ff99cc,stroke:#cc0066
style REG fill:#99ffff,stroke:#00cccc
CLI Command Flow
The CLI (cli.py) is the primary entry point with 25 command functions. All commands
follow a repeated pattern: parse args → load config → initialize resources → execute → handle errors.
graph TD
MAIN["main()"] --> PARSE[Parse Arguments]
PARSE --> DISPATCH{Command?}
DISPATCH -->|download| DL[download_command]
DISPATCH -->|create-embeddings| CE[create_embeddings_command]
DISPATCH -->|pre-process| PP[pre_process_command]
DISPATCH -->|search| SR[search_command]
DISPATCH -->|chat| CH[chat_command]
DISPATCH -->|web-ui| WU[web_ui_command]
DISPATCH -->|clustering run| CL[cluster_embeddings_command]
DISPATCH -->|clustering clear-cache| CC[clear_clustering_cache_command]
DISPATCH -->|clustering pre-generate| PG[pre_generate_clustering_command]
DISPATCH -->|delete-data| DD[delete_data_command]
DISPATCH -->|mcp-server| MS[mcp_server_command]
DISPATCH -->|eval generate| EG[eval_generate_command]
DISPATCH -->|eval verify| EV[eval_verify_command]
DISPATCH -->|eval run| ER[eval_run_command]
DISPATCH -->|eval results| ERS[eval_results_command]
DISPATCH -->|eval clear| ECL[eval_clear_command]
DISPATCH -->|registry upload| RU[registry_upload_command]
DISPATCH -->|registry download| RD[registry_download_command]
DISPATCH -->|registry list| RL[registry_list_command]
DISPATCH -->|registry delete| RDEL[registry_delete_command]
subgraph "Repeated CLI Pattern (⚠ duplicated in each command)"
A1[Load config] --> A2[Resolve conference/year]
A2 --> A3[Print header banner]
A3 --> A4[Validate embeddings DB]
A4 --> A5[Init EmbeddingsManager]
A5 --> A6[Test LM Studio connection]
A6 --> A7[Init DatabaseManager]
A7 --> A8[Execute command logic]
A8 --> A9[Error handling + traceback]
end
DL --> A1
CE --> A1
SR --> A1
CH --> A1
CL --> A1
PG --> A1
style A1 fill:#ffcccc
style A2 fill:#ffcccc
style A3 fill:#ffcccc
style A4 fill:#ffcccc
style A5 fill:#ffcccc
style A6 fill:#ffcccc
style A7 fill:#ffcccc
style A9 fill:#ffcccc
Data Pipeline Flow
graph LR
subgraph "Data Ingestion"
API[Conference APIs<br/>OpenReview, etc.] -->|download| PLG[Plugin System]
PLG -->|LightweightPaper| DB[(SQL Database<br/>SQLite/PostgreSQL)]
end
subgraph "Embedding Generation"
DB -->|paper text| EMB[EmbeddingsManager]
LLM1[LLM Backend<br/>LM Studio/Blablador] -->|embedding vectors| EMB
EMB -->|store| CHR[(ChromaDB)]
end
subgraph "Analysis & Search"
CHR -->|embeddings| CLU[ClusteringManager]
CLU -->|cache results| DB
CHR -->|similarity search| SEM[Semantic Search]
DB -->|keyword search| KW[Keyword Search]
end
subgraph "User Interfaces"
SEM --> WEB[Web UI]
KW --> WEB
CLU --> WEB
SEM --> RAG[RAG Chat]
CLU --> RAG
RAG --> MCP[MCP Tools]
WEB --> EXP[Export ZIP/Markdown]
end
subgraph "Registry"
DB -->|export| REG[OCI Registry]
CHR -->|export| REG
REG -->|import| DB
REG -->|import| CHR
end
Database Layer Flow
graph TD
subgraph "DatabaseManager - 2943 lines"
direction TB
CONN["Connection Management<br/>__init__, connect, close, __enter__, __exit__"]
subgraph "⚠ Duplicate: Session Validation"
SV1["if not self._session: raise DatabaseError<br/>(repeated in 30+ methods)"]
end
subgraph "Paper Operations"
ADD["add_paper / add_papers"]
SEARCH["search_papers<br/>(flexible filter builder)"]
KWSEARCH["search_papers_keyword<br/>(⚠ thin wrapper around search_papers)"]
AUTHSEARCH["search_authors_in_papers"]
end
subgraph "⚠ Duplicate: Faceting Methods"
SESS["get_sessions(conference, year)"]
CONF["get_conferences(year)"]
YEARS["get_years(conference)"]
YFC["get_years_for_conference(conference)<br/>⚠ duplicate of get_years"]
CY["get_conference_years_from_db()"]
end
subgraph "Clustering Cache"
CGET["get_clustering_cache"]
CSAVE["save_clustering_cache"]
CDEL["delete_clustering_cache_by_conference_year"]
CCNT["count_clustering_cache_by_conference_year"]
CCLR["clear_clustering_cache"]
end
subgraph "⚠ Duplicate: CRUD Patterns"
direction LR
QA["Eval QA Pairs<br/>add, get, count, update, delete"]
ER["Eval Results<br/>add, get, delete"]
end
subgraph "Import/Export"
EXPS["export_papers_to_sqlite"]
IMPS["import_papers_from_sqlite"]
EXPC["export_clustering_cache_to_json"]
IMPC["import_clustering_cache_from_json"]
end
end
CONN --> SV1
SV1 --> ADD
SV1 --> SEARCH
SEARCH --> KWSEARCH
SV1 --> SESS
SV1 --> CONF
SV1 --> YEARS
YEARS -.->|"duplicate"| YFC
SV1 --> CGET
SV1 --> QA
SV1 --> ER
SV1 --> EXPS
style YFC fill:#ffcccc,stroke:#cc0000
style SV1 fill:#ffcccc,stroke:#cc0000
style KWSEARCH fill:#ffffcc,stroke:#cccc00
Embeddings & Clustering Flow
graph TD
subgraph "EmbeddingsManager"
ECONN["connect() → ChromaDB"]
ECOL["create_collection()"]
EADD["add_paper → generate_embedding → store"]
ESRCH["search_similar / search_papers_semantic"]
EFIND["find_papers_within_distance"]
EIMP["import/export_embeddings"]
subgraph "⚠ Duplicate: Where-clause building"
W1["search_papers_semantic:<br/>build $and/$in filter"]
W2["find_papers_within_distance:<br/>build $and/$in filter"]
end
end
subgraph "ClusteringManager"
CLOAD["load_embeddings"]
CREDUC["reduce_dimensions<br/>(PCA / t-SNE / UMAP)"]
CCLUST["cluster<br/>(K-Means / DBSCAN / Agglom / Spectral / Fuzzy)"]
CKEYS["extract_cluster_keywords"]
CLBL["generate_cluster_labels"]
CHRCH["generate_hierarchical_labels"]
CRES["get_clustering_results"]
subgraph "⚠ Duplicate: TF-IDF extraction"
T1["extract_cluster_keywords"]
T2["_extract_keywords_for_samples<br/>⚠ nearly identical to T1"]
end
subgraph "⚠ Duplicate: LLM label generation"
L1["_generate_llm_label"]
L2["_generate_parent_label_llm<br/>⚠ same pattern as L1"]
L3["_generate_llm_label_from_keywords<br/>⚠ same pattern as L1"]
end
subgraph "⚠ Duplicate: Where-clause building"
W3["load_embeddings:<br/>build $and/$in filter"]
end
end
subgraph "Standalone Functions"
PF["perform_clustering<br/>(convenience wrapper)"]
CWC["compute_clusters_with_cache<br/>(multi-level caching)"]
W4["⚠ also builds where-clauses"]
end
ECONN --> ECOL --> EADD
ECOL --> ESRCH
ECOL --> EFIND
CLOAD --> CREDUC --> CCLUST --> CKEYS --> CLBL
CCLUST --> CHRCH
CLBL --> CRES
PF --> CLOAD
CWC --> CLOAD
W1 -.->|"duplicate logic"| W2
W1 -.->|"duplicate logic"| W3
W1 -.->|"duplicate logic"| W4
T1 -.->|"duplicate logic"| T2
L1 -.->|"duplicate logic"| L2
L1 -.->|"duplicate logic"| L3
style W1 fill:#ffcccc,stroke:#cc0000
style W2 fill:#ffcccc,stroke:#cc0000
style W3 fill:#ffcccc,stroke:#cc0000
style W4 fill:#ffcccc,stroke:#cc0000
style T1 fill:#ffcccc,stroke:#cc0000
style T2 fill:#ffcccc,stroke:#cc0000
style L1 fill:#ffcccc,stroke:#cc0000
style L2 fill:#ffcccc,stroke:#cc0000
style L3 fill:#ffcccc,stroke:#cc0000
RAG Chat & MCP Tools Flow
graph TD
subgraph "RAG Chat (rag.py)"
RQUERY["query()"]
RCHAT["chat()"]
RBUILD["_build_agent()"]
subgraph "⚠ Duplicate: Tool wrappers (6 identical patterns)"
TW1["_tool_search_papers"]
TW2["_tool_get_conference_topics"]
TW3["_tool_get_topic_evolution"]
TW4["_tool_analyze_topic_relevance"]
TW5["_tool_get_cluster_visualization"]
TW6["_tool_get_paper_details"]
end
end
subgraph "MCP Tools (mcp_tools.py)"
EXEC["execute_mcp_tool<br/>(dispatcher)"]
SCHEMA["get_mcp_tools_schema"]
subgraph "⚠ Duplicate: Normalization (4 similar functions)"
N1["_normalize_search_papers_args"]
N2["_normalize_get_topic_evolution_args"]
N3["_normalize_analyze_topic_relevance_args"]
N4["_normalize_get_paper_details_args"]
end
subgraph "⚠ Duplicate: Formatters (6 similar functions)"
F1["_format_topic_relevance_result"]
F2["_format_conference_topics_result"]
F3["_format_topic_evolution_result"]
F4["_format_search_papers_result"]
F5["_format_visualization_result"]
F6["_format_paper_details_result"]
end
end
subgraph "MCP Server (mcp_server.py)"
subgraph "Core Tool Functions"
S1["search_papers"]
S2["get_conference_topics"]
S3["get_topic_evolution"]
S4["analyze_topic_relevance"]
S5["get_cluster_visualization"]
S6["get_paper_details"]
end
subgraph "⚠ Duplicate: Resource init"
RI["get_config → EmbeddingsManager → DatabaseManager<br/>(repeated in each tool function)"]
end
subgraph "⚠ Duplicate: Where-clause merging"
WM1["merge_where_clause_with_conference"]
WM2["merge_where_clause_with_years"]
end
end
RBUILD --> TW1 & TW2 & TW3 & TW4 & TW5 & TW6
TW1 --> S1
TW2 --> S2
TW3 --> S3
TW4 --> S4
TW5 --> S5
TW6 --> S6
EXEC --> N1 --> S1
EXEC --> N2 --> S3
EXEC --> N3 --> S4
EXEC --> N4 --> S6
S1 --> WM1
S1 --> WM2
style TW1 fill:#ffcccc,stroke:#cc0000
style TW2 fill:#ffcccc,stroke:#cc0000
style TW3 fill:#ffcccc,stroke:#cc0000
style TW4 fill:#ffcccc,stroke:#cc0000
style TW5 fill:#ffcccc,stroke:#cc0000
style TW6 fill:#ffcccc,stroke:#cc0000
style N1 fill:#ffcccc,stroke:#cc0000
style N2 fill:#ffcccc,stroke:#cc0000
style N3 fill:#ffcccc,stroke:#cc0000
style N4 fill:#ffcccc,stroke:#cc0000
style F1 fill:#ffffcc,stroke:#cccc00
style F2 fill:#ffffcc,stroke:#cccc00
style F3 fill:#ffffcc,stroke:#cccc00
style F4 fill:#ffffcc,stroke:#cccc00
style F5 fill:#ffffcc,stroke:#cccc00
style F6 fill:#ffffcc,stroke:#cccc00
style RI fill:#ffcccc,stroke:#cc0000
Web UI Request Flow
graph TD
subgraph "Flask Web UI (app.py - 22 routes)"
REQ[HTTP Request] --> MW["Middleware<br/>ProxyFix, CORS, teardown_db"]
MW --> IDX["GET / → index()"]
MW --> CIDX["GET /<conference> → conference_index()"]
MW --> HLTH["GET /health"]
MW --> STAT["GET /api/stats"]
MW --> EMOD["GET /api/embedding-model-check"]
MW --> FILT["GET /api/filters"]
MW --> AFILT["GET /api/available-filters"]
MW --> SRCH["POST /api/search"]
MW --> GPAP["GET /api/paper/<uid>"]
MW --> BATCH["POST /api/papers/batch"]
MW --> CHAT["POST /api/chat"]
MW --> RSET["POST /api/chat/reset"]
MW --> CCMP["POST /api/clusters/compute"]
MW --> CCCH["GET /api/clusters/cached"]
MW --> CDEF["GET /api/clusters/default-count"]
MW --> CSRCH["POST /api/clusters/search"]
MW --> GYRS["GET /api/years"]
MW --> EXPRT["POST /api/export/interesting-papers"]
MW --> DOND["POST /api/donate-data"]
MW --> DONC["POST /api/donate-chat"]
subgraph "⚠ Repeated Pattern in All Routes"
P1["db = get_database()"]
P2["try/except + logger.error + jsonify"]
P3["Parameter validation + type conversion"]
end
end
SRCH -->|keyword| DB[(DatabaseManager)]
SRCH -->|semantic| EMB[(EmbeddingsManager)]
CHAT --> RAG[RAGChat]
CCMP --> CLU[ClusteringManager]
EXPRT --> EXU[export_utils]
GPAP --> PAP[paper_utils]
style P1 fill:#ffcccc,stroke:#cc0000
style P2 fill:#ffcccc,stroke:#cc0000
style P3 fill:#ffcccc,stroke:#cc0000
Export & Paper Utilities Flow
graph TD
subgraph "export_utils.py"
EXP["export_papers_to_zip"] --> GFS["generate_folder_structure_export"]
GFS --> GMR["generate_main_readme"]
GFS --> GAP["generate_all_papers_markdown"]
GFS --> GST["generate_search_term_markdown"]
subgraph "⚠ Duplicate: Paper formatting"
GAP_F["generate_all_papers_markdown:<br/>group by session → format paper block"]
GST_F["generate_search_term_markdown:<br/>group by session → format paper block<br/>⚠ nearly identical to GAP"]
end
end
subgraph "paper_utils.py"
GPA["get_paper_with_authors"] --> DB[(DatabaseManager)]
FSR["format_search_results"] --> GPA
BCP["build_context_from_papers"]
end
style GAP_F fill:#ffcccc,stroke:#cc0000
style GST_F fill:#ffcccc,stroke:#cc0000
Registry Upload/Download Flow
graph LR
subgraph "RegistryClient"
UP["upload()"] --> EY["_export_year()"]
EY --> PT["_push_tag()"]
DL["download()"] --> FM["_find_best_matching_tag()"]
FM --> IY["_import_year()"]
IY --> CE["_check_embedding_model()"]
subgraph "⚠ Duplicate: Progress callbacks"
PC1["upload: def _progress(...)"]
PC2["download: def _progress(...)"]
PC3["_export_year: def _progress(...)"]
PC4["_import_year: def _progress(...)"]
end
end
EY --> DB[(DatabaseManager<br/>export_papers_to_sqlite<br/>export_clustering_cache_to_json)]
EY --> EMB[(EmbeddingsManager<br/>export_embeddings)]
IY --> DB2[(DatabaseManager<br/>import_papers_from_sqlite<br/>import_clustering_cache_from_json)]
IY --> EMB2[(EmbeddingsManager<br/>import_embeddings)]
style PC1 fill:#ffcccc,stroke:#cc0000
style PC2 fill:#ffcccc,stroke:#cc0000
style PC3 fill:#ffcccc,stroke:#cc0000
style PC4 fill:#ffcccc,stroke:#cc0000
Summary of Duplicate Code Paths
The following table summarizes all identified duplicate code patterns, ordered by severity:
# |
Pattern |
Occurrences |
Modules |
Severity |
|---|---|---|---|---|
1 |
Session validation boilerplate ( |
30+ |
High |
|
2 |
ChromaDB where-clause construction ( |
5+ |
embeddings.py, clustering.py, mcp_server.py |
High |
3 |
EmbeddingsManager initialization sequence |
6 |
High |
|
4 |
LM Studio connection test + error message |
4 |
Medium |
|
5 |
Embeddings DB path validation |
4 |
Medium |
|
6 |
CLI command header printing |
15+ |
Medium |
|
7 |
TF-IDF keyword extraction |
2 |
Medium |
|
8 |
LLM label generation (OpenAI call + fallback) |
3 |
Medium |
|
9 |
RAG tool wrapper functions (identical pattern) |
6 |
Medium |
|
10 |
MCP argument normalization functions |
4 |
mcp_tools.py |
Medium |
11 |
Resource init in MCP tools (config → embed → db) |
6 |
mcp_server.py |
Medium |
12 |
Conference/year argument resolution |
8+ |
Low |
|
13 |
Paper markdown formatting (session grouping) |
2 |
export_utils.py |
Low |
14 |
Error handling + traceback pattern |
15+ |
Low |
|
15 |
Faceting query pattern |
4 |
Low |
|
16 |
Eval CRUD parallel structures |
2×5 |
Low |
|
17 |
Progress callback definitions |
4 |
Low |
|
18 |
Web route error handling (try/except/jsonify) |
22 |
web_ui/app.py |
Low |
19 |
|
2 |
Low |