Evaluation Module

The evaluation module implements automatic evaluation of the RAG system using an LLM-as-judge approach.

Features

Generate evaluation question-answer pairs from the paper database and MCP tools
Run the RAG pipeline on stored Q/A pairs and score output with an LLM judge
Compute summary statistics and format results for display
Support for follow-up question evaluation

Quick Start

from abstracts_explorer.evaluation import Evaluator
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager

em = EmbeddingsManager()
db = DatabaseManager()

evaluator = Evaluator(em, db)

# Generate Q/A pairs using the LLM
pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2)
evaluator.store_qa_pairs(pairs)

# Run evaluation
summary = evaluator.run_evaluation()
print(summary)

API Reference

Automatic Evaluation

This module implements automatic evaluation of the RAG system via the Evaluator class. It provides methods for:

Generating evaluation Q/A pairs from the paper database and MCP tools using an LLM.
Running the RAG pipeline on stored Q/A pairs and scoring the output with an LLM-as-judge approach.
Computing summary statistics and formatting results for display.

exception abstracts_explorer.evaluation.EvaluationError[source]

Bases: Exception

Exception raised for evaluation-related errors.

abstracts_explorer.evaluation.format_eval_summary(summary, run_id)[source]

Format an evaluation run summary for display.

Parameters:

summary (dict) – Summary from DatabaseManager.get_eval_run_summary().
run_id (str) – The evaluation run identifier.

Returns:

Human-readable summary string.

Return type:

str

abstracts_explorer.evaluation.format_eval_result_detail(result, qa_pair=None)[source]

Format a single evaluation result for display.

Parameters:

result (dict) – Result row from DatabaseManager.get_eval_results().
qa_pair (dict, optional) – Corresponding Q/A pair for additional context.

Returns:

Human-readable detail string.

Return type:

str

class abstracts_explorer.evaluation.Evaluator(embeddings_manager, db, model=None)[source]

Bases: object

Automatic evaluation of the RAG system.

Wraps all evaluation operations — Q/A pair generation, evaluation execution, and result storage — sharing a single EmbeddingsManager and its openai_client for LLM calls.

Parameters:

embeddings_manager (EmbeddingsManager) – A connected embeddings manager. Its openai_client property is used for all LLM calls (generation and judging).
db (DatabaseManager) – A connected database manager.
model (str, optional) – Chat model name. Falls back to config default.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> with DatabaseManager() as db:
...     evaluator = Evaluator(em, db)
...     pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2)
...     evaluator.store_qa_pairs(pairs)
...     run_id = evaluator.run_evaluation()
...     print(evaluator.format_run_summary(run_id))

__init__(embeddings_manager, db, model=None)[source]

property openai_client

OpenAI client shared with the embeddings manager.

Returns:: The lazily-initialised client from EmbeddingsManager.
Return type:: OpenAI

generate_qa_pairs(n_pairs_per_tool=2, tools=None, generate_followups=True, n_followups=1)[source]

Generate evaluation Q/A pairs using the LLM.

For each requested MCP tool a set of query/answer pairs is generated based on papers sampled from the database. Optionally, follow-up questions are generated for each initial pair.

Parameters:

n_pairs_per_tool (int) – Number of initial Q/A pairs to generate per tool.
tools (list of str, optional) – MCP tool names to generate pairs for. Defaults to all tools.
generate_followups (bool) – Whether to generate follow-up questions.
n_followups (int) – Number of follow-up turns per initial pair.

Returns:

Generated pairs, each with keys: conversation_id, turn_number, query, expected_answer, tool_name, source_info.

Return type:

list of dict

Raises:

EvaluationError – If generation fails.

store_qa_pairs(pairs)[source]

Persist generated Q/A pairs into the database.

Parameters:: pairs (list of dict) – Pairs as returned by generate_qa_pairs().
Returns:: Number of pairs stored.
Return type:: int

run_evaluation(verified_only=True, limit=None)[source]

Run evaluation on stored Q/A pairs and record results.

Executes each stored query through the RAG system, scores the output with an LLM judge, and stores the results in the database.

Parameters:

verified_only (bool) – If True, only evaluate verified pairs (default).
limit (int, optional) – Maximum number of pairs to evaluate.

Returns:

The run_id for the evaluation run.

Return type:

str

Raises:

EvaluationError – If evaluation fails.

format_run_summary(run_id)[source]

Compute and format the summary for run_id.

Parameters:: run_id (str) – Evaluation run identifier.
Returns:: Human-readable summary.
Return type:: str