Overview
Evaluation types, metrics, and how scoring works
Evaluations
An evaluation runs your RAG pipeline against a benchmark and computes quantitative metrics. Vecta supports three evaluation types, each measuring different aspects of your system.
Evaluation Types
| Type | Function Signature | Metrics |
|---|---|---|
| Retrieval only | query → list[str] | Precision, Recall, F1 at chunk / page / document levels |
| Generation only | query → str | Accuracy, Groundedness |
| Retrieval + Generation | query → (list[str], str) | All of the above |
Retrieval Metrics
Retrieval metrics are computed at three semantic granularities:
Chunk-level
Compares the exact set of chunk IDs your system retrieved against the ground-truth chunk IDs in the benchmark.
- Precision — What fraction of retrieved chunks are actually relevant?
- Recall — What fraction of relevant chunks did the system find?
- F1 Score — Harmonic mean of precision and recall
Page-level
Compares page numbers of retrieved chunks against ground-truth pages (when available). Uses overlap logic: a retrieved chunk earns credit if any of its pages match an expected page for the same document.
Page-level metrics are only computed when benchmark entries have
page_numsvalues.
Document-level
Compares the set of source documents (file names) represented by retrieved chunks against ground-truth documents.
Generation Metrics
Generation quality is assessed using LLM-as-a-judge scoring:
- Accuracy (0–1) — How well does the generated answer match the expected answer? Measures semantic equivalence, not exact string match.
- Groundedness (0–1) — Is the generated answer self-consistent and free from invented facts? Measures whether the answer stays grounded in retrievable information.
Each question is scored independently, and the final metrics are averages across all questions.
How Scoring Works
- Your function is called once per benchmark question
- Retrieval metrics are computed by set intersection between retrieved and expected IDs
- Generation metrics use a structured LLM call that returns JSON scores
- Per-question detailed results are stored for drill-down analysis
Evaluation Lifecycle
- Pending — Created but not yet started
- Running — Currently executing
- Completed — Finished successfully with results
- Failed — Encountered an error
Viewing Results
Platform Dashboard
The Evaluations page shows all completed evaluations with:
- Evaluation type badge (Retrieval / Generation / RAG)
- Chunk, page, and document F1 scores
- Generation accuracy and groundedness
- Duration and question count
- Drill-down into per-question results
PDF Export
Export a polished certification-style PDF report from the evaluations page. The report includes an overall grade, SLA compliance status, and metric tables with color-coded heatmaps.
Detailed Results
Click into any evaluation to see per-question breakdowns:
- Which chunks were retrieved vs. expected
- Which chunks matched and which were missed
- Individual precision/recall/F1 per question
- Generated answer vs. expected answer
- Per-question accuracy and groundedness scores
Next Steps
- Retrieval Only — Evaluate search quality
- Generation Only — Evaluate LLM responses
- Retrieval + Generation — Evaluate your full pipeline
- Experiments — Compare evaluation runs