Evaluation Module

The evaluation module implements automatic evaluation of the RAG system using an LLM-as-judge approach.

Features

  • Generate evaluation question-answer pairs from the paper database and MCP tools

  • Run the RAG pipeline on stored Q/A pairs and score output with an LLM judge

  • Compute summary statistics and format results for display

  • Support for follow-up question evaluation

Quick Start

from abstracts_explorer.evaluation import Evaluator
from abstracts_explorer.embeddings import EmbeddingsManager
from abstracts_explorer.database import DatabaseManager

em = EmbeddingsManager()
db = DatabaseManager()

evaluator = Evaluator(em, db)

# Generate Q/A pairs using the LLM
pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2)
evaluator.store_qa_pairs(pairs)

# Run evaluation
summary = evaluator.run_evaluation()
print(summary)

API Reference

Automatic Evaluation

This module implements automatic evaluation of the RAG system via the Evaluator class. It provides methods for:

  • Generating evaluation Q/A pairs from the paper database and MCP tools using an LLM.

  • Running the RAG pipeline on stored Q/A pairs and scoring the output with an LLM-as-judge approach.

  • Computing summary statistics and formatting results for display.

exception abstracts_explorer.evaluation.EvaluationError[source]

Bases: Exception

Exception raised for evaluation-related errors.

abstracts_explorer.evaluation.format_eval_summary(summary, run_id)[source]

Format an evaluation run summary for display.

Parameters:
  • summary (dict) – Summary from DatabaseManager.get_eval_run_summary().

  • run_id (str) – The evaluation run identifier.

Returns:

Human-readable summary string.

Return type:

str

abstracts_explorer.evaluation.format_eval_result_detail(result, qa_pair=None)[source]

Format a single evaluation result for display.

Parameters:
  • result (dict) – Result row from DatabaseManager.get_eval_results().

  • qa_pair (dict, optional) – Corresponding Q/A pair for additional context.

Returns:

Human-readable detail string.

Return type:

str

class abstracts_explorer.evaluation.Evaluator(embeddings_manager, db, model=None)[source]

Bases: object

Automatic evaluation of the RAG system.

Wraps all evaluation operations — Q/A pair generation, evaluation execution, and result storage — sharing a single EmbeddingsManager and its openai_client for LLM calls.

Parameters:
  • embeddings_manager (EmbeddingsManager) – A connected embeddings manager. Its openai_client property is used for all LLM calls (generation and judging).

  • db (DatabaseManager) – A connected database manager.

  • model (str, optional) – Chat model name. Falls back to config default.

Examples

>>> em = EmbeddingsManager()
>>> em.connect()
>>> with DatabaseManager() as db:
...     evaluator = Evaluator(em, db)
...     pairs = evaluator.generate_qa_pairs(n_pairs_per_tool=2)
...     evaluator.store_qa_pairs(pairs)
...     run_id = evaluator.run_evaluation()
...     print(evaluator.format_run_summary(run_id))
__init__(embeddings_manager, db, model=None)[source]
property openai_client

OpenAI client shared with the embeddings manager.

Returns:

The lazily-initialised client from EmbeddingsManager.

Return type:

OpenAI

generate_qa_pairs(n_pairs_per_tool=2, tools=None, generate_followups=True, n_followups=1)[source]

Generate evaluation Q/A pairs using the LLM.

For each requested MCP tool a set of query/answer pairs is generated based on papers sampled from the database. Optionally, follow-up questions are generated for each initial pair.

Parameters:
  • n_pairs_per_tool (int) – Number of initial Q/A pairs to generate per tool.

  • tools (list of str, optional) – MCP tool names to generate pairs for. Defaults to all tools.

  • generate_followups (bool) – Whether to generate follow-up questions.

  • n_followups (int) – Number of follow-up turns per initial pair.

Returns:

Generated pairs, each with keys: conversation_id, turn_number, query, expected_answer, tool_name, source_info.

Return type:

list of dict

Raises:

EvaluationError – If generation fails.

store_qa_pairs(pairs)[source]

Persist generated Q/A pairs into the database.

Parameters:

pairs (list of dict) – Pairs as returned by generate_qa_pairs().

Returns:

Number of pairs stored.

Return type:

int

run_evaluation(verified_only=True, limit=None)[source]

Run evaluation on stored Q/A pairs and record results.

Executes each stored query through the RAG system, scores the output with an LLM judge, and stores the results in the database.

Parameters:
  • verified_only (bool) – If True, only evaluate verified pairs (default).

  • limit (int, optional) – Maximum number of pairs to evaluate.

Returns:

The run_id for the evaluation run.

Return type:

str

Raises:

EvaluationError – If evaluation fails.

format_run_summary(run_id)[source]

Compute and format the summary for run_id.

Parameters:

run_id (str) – Evaluation run identifier.

Returns:

Human-readable summary.

Return type:

str