Back to Docs
Evaluations

Overview

Evaluation types, metrics, and how scoring works

Last updated: August 20, 2025
Category: evaluations

Evaluations

An evaluation runs your RAG pipeline against a benchmark and computes quantitative metrics. Vecta supports three evaluation types, each measuring different aspects of your system.

Evaluation Types

TypeFunction SignatureMetrics
Retrieval onlyquery → list[str]Precision, Recall, F1 at chunk / page / document levels
Generation onlyquery → strAccuracy, Groundedness
Retrieval + Generationquery → (list[str], str)All of the above

Retrieval Metrics

Retrieval metrics are computed at three semantic granularities:

Chunk-level

Compares the exact set of chunk IDs your system retrieved against the ground-truth chunk IDs in the benchmark.

  • Precision — What fraction of retrieved chunks are actually relevant?
  • Recall — What fraction of relevant chunks did the system find?
  • F1 Score — Harmonic mean of precision and recall

Page-level

Compares page numbers of retrieved chunks against ground-truth pages (when available). Uses overlap logic: a retrieved chunk earns credit if any of its pages match an expected page for the same document.

Page-level metrics are only computed when benchmark entries have page_nums values.

Document-level

Compares the set of source documents (file names) represented by retrieved chunks against ground-truth documents.

Generation Metrics

Generation quality is assessed using LLM-as-a-judge scoring:

  • Accuracy (0–1) — How well does the generated answer match the expected answer? Measures semantic equivalence, not exact string match.
  • Groundedness (0–1) — Is the generated answer self-consistent and free from invented facts? Measures whether the answer stays grounded in retrievable information.

Each question is scored independently, and the final metrics are averages across all questions.

How Scoring Works

  1. Your function is called once per benchmark question
  2. Retrieval metrics are computed by set intersection between retrieved and expected IDs
  3. Generation metrics use a structured LLM call that returns JSON scores
  4. Per-question detailed results are stored for drill-down analysis

Evaluation Lifecycle

  1. Pending — Created but not yet started
  2. Running — Currently executing
  3. Completed — Finished successfully with results
  4. Failed — Encountered an error

Viewing Results

Platform Dashboard

The Evaluations page shows all completed evaluations with:

  • Evaluation type badge (Retrieval / Generation / RAG)
  • Chunk, page, and document F1 scores
  • Generation accuracy and groundedness
  • Duration and question count
  • Drill-down into per-question results

PDF Export

Export a polished certification-style PDF report from the evaluations page. The report includes an overall grade, SLA compliance status, and metric tables with color-coded heatmaps.

Detailed Results

Click into any evaluation to see per-question breakdowns:

  • Which chunks were retrieved vs. expected
  • Which chunks matched and which were missed
  • Individual precision/recall/F1 per question
  • Generated answer vs. expected answer
  • Per-question accuracy and groundedness scores

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.