Back to Docs
Evaluations
Retrieval Only
Measure search quality with precision, recall, and F1
Last updated: August 20, 2025
Category: evaluations
Retrieval-Only Evaluation
A retrieval-only evaluation measures how well your search system finds the right chunks — without assessing generation quality.
Function Signature
Your retrieval function must accept a query string and return a list of chunk IDs:
def my_retriever(query: str) -> list[str]:
"""Return chunk IDs for the given query."""
results = vector_db.search(query, k=10)
return [r.id for r in results]
Using the API Client
from vecta import VectaAPIClient
client = VectaAPIClient()
def my_retriever(query: str) -> list[str]:
results = my_vector_db.search(query, k=10)
return [r.id for r in results]
results = client.evaluate_retrieval(
benchmark_id="your-benchmark-id",
retrieval_function=my_retriever,
evaluation_name="baseline-k10",
experiment_id=None, # optional: group into an experiment
metadata={"top_k": 10}, # optional: attach metadata for comparison
)
| Parameter | Type | Default | Description |
|---|---|---|---|
benchmark_id | str | required | ID of an active benchmark |
retrieval_function | Callable[[str], List[str]] | required | Your retrieval function |
evaluation_name | str | "API Evaluation" | Human-readable name |
experiment_id | str | None | None | Optional experiment to group under |
metadata | dict | None | None | Arbitrary key-value metadata |
Returns: BenchmarkResults
Using the Local Client
from vecta import VectaClient
vecta = VectaClient(data_source_connector=my_connector)
vecta.load_knowledge_base()
vecta.load_benchmark("my_benchmark.csv")
results = vecta.evaluate_retrieval(
my_retriever,
evaluation_name="baseline-k10",
)
Reading Results
# Aggregate metrics
print(f"Chunk Precision: {results.chunk_level.precision:.2%}")
print(f"Chunk Recall: {results.chunk_level.recall:.2%}")
print(f"Chunk F1: {results.chunk_level.f1_score:.2%}")
print(f"Document Precision: {results.document_level.precision:.2%}")
print(f"Document Recall: {results.document_level.recall:.2%}")
print(f"Document F1: {results.document_level.f1_score:.2%}")
# Page-level (only if benchmark has page_nums)
if results.page_level:
print(f"Page F1: {results.page_level.f1_score:.2%}")
# Duration
print(f"Duration: {results.duration_seconds}s")
# Per-question details
for row in results.detailed_results:
print(f"Q: {row.question[:60]}...")
print(f" Chunk F1: {row.chunk_f1:.2%}")
print(f" Matched: {row.matched_chunk_ids}")
print(f" Missed: {row.missed_chunk_ids}")
Result Schema
The BenchmarkResults object contains:
| Field | Type | Description |
|---|---|---|
chunk_level | EvaluationMetrics | Precision, recall, F1 at chunk level |
page_level | EvaluationMetrics | None | Page-level metrics (if available) |
document_level | EvaluationMetrics | Document-level metrics |
detailed_results | List[EvaluationResultRow] | Per-question breakdowns |
total_questions | int | Number of questions evaluated |
duration_seconds | int | Wall-clock time |
evaluation_name | str | Name of this run |
metadata | dict | None | Attached metadata |
Tips
- Chunk IDs must match — The IDs returned by your function must exactly match the
chunk_idsin the benchmark. Ensure you're using the same ID scheme. - Empty returns — If your function returns an empty list for a question, precision and recall will both be 0 for that question.
- Attach metadata — Use the
metadataparameter to record configuration details liketop_k,embedding_model, orchunk_size. This makes experiment comparisons much easier.
Next Steps
- Generation Only — Evaluate LLM output quality
- Retrieval + Generation — Evaluate your full pipeline
- Experiments — Compare multiple retrieval configurations