Evaluation Module
The evaluation module implements automatic evaluation of the RAG system using an LLM-as-judge approach.
Features
Generate evaluation question-answer pairs from the paper database and MCP tools
Run the RAG pipeline on stored Q/A pairs and score output with an LLM judge
Compute summary statistics and format results for display
Support for follow-up question evaluation
Quick Start
from abstracts_explorer.evaluation import Evaluator
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager
em = EmbeddingsManager()
db = DatabaseManager()
evaluator = Evaluator(em, db)
# Generate Q/A pairs using the LLM
pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2)
evaluator.store_qa_pairs(pairs)
# Run evaluation
summary = evaluator.run_evaluation()
print(summary)
API Reference
Automatic Evaluation
This module implements automatic evaluation of the RAG system via the
Evaluator class. It provides methods for:
Generating evaluation Q/A pairs from the paper database and MCP tools using an LLM.
Running the RAG pipeline on stored Q/A pairs and scoring the output with an LLM-as-judge approach.
Computing summary statistics and formatting results for display.
- exception abstracts_explorer.evaluation.EvaluationError[source]
Bases:
ExceptionException raised for evaluation-related errors.
- abstracts_explorer.evaluation.format_eval_summary(summary, run_id)[source]
Format an evaluation run summary for display.
- abstracts_explorer.evaluation.format_eval_result_detail(result, qa_pair=None)[source]
Format a single evaluation result for display.
- class abstracts_explorer.evaluation.Evaluator(embeddings_manager, db, model=None)[source]
Bases:
objectAutomatic evaluation of the RAG system.
Wraps all evaluation operations — Q/A pair generation, evaluation execution, and result storage — sharing a single
EmbeddingsManagerand itsopenai_clientfor LLM calls.- Parameters:
embeddings_manager (EmbeddingsManager) – A connected embeddings manager. Its
openai_clientproperty is used for all LLM calls (generation and judging).db (DatabaseManager) – A connected database manager.
model (str, optional) – Chat model name. Falls back to config default.
Examples
>>> em = EmbeddingsManager() >>> em.connect() >>> with DatabaseManager() as db: ... evaluator = Evaluator(em, db) ... pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2) ... evaluator.store_qa_pairs(pairs) ... run_id = evaluator.run_evaluation() ... print(evaluator.format_run_summary(run_id))
- property openai_client
OpenAI client shared with the embeddings manager.
- Returns:
The lazily-initialised client from
EmbeddingsManager.- Return type:
OpenAI
- generate_qa_pairs(n_pairs_per_tool=2, tools=None, generate_followups=True, n_followups=1)[source]
Generate evaluation Q/A pairs using the LLM.
For each requested MCP tool a set of query/answer pairs is generated based on papers sampled from the database. Optionally, follow-up questions are generated for each initial pair.
- Parameters:
n_pairs_per_tool (int) – Number of initial Q/A pairs to generate per tool.
tools (list of str, optional) – MCP tool names to generate pairs for. Defaults to all tools.
generate_followups (bool) – Whether to generate follow-up questions.
n_followups (int) – Number of follow-up turns per initial pair.
- Returns:
Generated pairs, each with keys:
conversation_id,turn_number,query,expected_answer,tool_name,source_info.- Return type:
- Raises:
EvaluationError – If generation fails.
- store_qa_pairs(pairs)[source]
Persist generated Q/A pairs into the database.
- Parameters:
pairs (list of dict) – Pairs as returned by
generate_qa_pairs().- Returns:
Number of pairs stored.
- Return type:
- run_evaluation(verified_only=True, limit=None)[source]
Run evaluation on stored Q/A pairs and record results.
Executes each stored query through the RAG system, scores the output with an LLM judge, and stores the results in the database.
- Parameters:
- Returns:
The
run_idfor the evaluation run.- Return type:
- Raises:
EvaluationError – If evaluation fails.