Back to Docs
Evaluations

Retrieval Only

Measure search quality with precision, recall, and F1

Last updated: August 20, 2025
Category: evaluations

Retrieval-Only Evaluation

A retrieval-only evaluation measures how well your search system finds the right chunks — without assessing generation quality.

Function Signature

Your retrieval function must accept a query string and return a list of chunk IDs:

def my_retriever(query: str) -> list[str]:
    """Return chunk IDs for the given query."""
    results = vector_db.search(query, k=10)
    return [r.id for r in results]

Using the API Client

from vecta import VectaAPIClient

client = VectaAPIClient()

def my_retriever(query: str) -> list[str]:
    results = my_vector_db.search(query, k=10)
    return [r.id for r in results]

results = client.evaluate_retrieval(
    benchmark_id="your-benchmark-id",
    retrieval_function=my_retriever,
    evaluation_name="baseline-k10",
    experiment_id=None,       # optional: group into an experiment
    metadata={"top_k": 10},   # optional: attach metadata for comparison
)
ParameterTypeDefaultDescription
benchmark_idstrrequiredID of an active benchmark
retrieval_functionCallable[[str], List[str]]requiredYour retrieval function
evaluation_namestr"API Evaluation"Human-readable name
experiment_idstr | NoneNoneOptional experiment to group under
metadatadict | NoneNoneArbitrary key-value metadata

Returns: BenchmarkResults

Using the Local Client

from vecta import VectaClient

vecta = VectaClient(data_source_connector=my_connector)
vecta.load_knowledge_base()
vecta.load_benchmark("my_benchmark.csv")

results = vecta.evaluate_retrieval(
    my_retriever,
    evaluation_name="baseline-k10",
)

Reading Results

# Aggregate metrics
print(f"Chunk Precision:    {results.chunk_level.precision:.2%}")
print(f"Chunk Recall:       {results.chunk_level.recall:.2%}")
print(f"Chunk F1:           {results.chunk_level.f1_score:.2%}")

print(f"Document Precision: {results.document_level.precision:.2%}")
print(f"Document Recall:    {results.document_level.recall:.2%}")
print(f"Document F1:        {results.document_level.f1_score:.2%}")

# Page-level (only if benchmark has page_nums)
if results.page_level:
    print(f"Page F1: {results.page_level.f1_score:.2%}")

# Duration
print(f"Duration: {results.duration_seconds}s")

# Per-question details
for row in results.detailed_results:
    print(f"Q: {row.question[:60]}...")
    print(f"  Chunk F1: {row.chunk_f1:.2%}")
    print(f"  Matched: {row.matched_chunk_ids}")
    print(f"  Missed:  {row.missed_chunk_ids}")

Result Schema

The BenchmarkResults object contains:

FieldTypeDescription
chunk_levelEvaluationMetricsPrecision, recall, F1 at chunk level
page_levelEvaluationMetrics | NonePage-level metrics (if available)
document_levelEvaluationMetricsDocument-level metrics
detailed_resultsList[EvaluationResultRow]Per-question breakdowns
total_questionsintNumber of questions evaluated
duration_secondsintWall-clock time
evaluation_namestrName of this run
metadatadict | NoneAttached metadata

Tips

  • Chunk IDs must match — The IDs returned by your function must exactly match the chunk_ids in the benchmark. Ensure you're using the same ID scheme.
  • Empty returns — If your function returns an empty list for a question, precision and recall will both be 0 for that question.
  • Attach metadata — Use the metadata parameter to record configuration details like top_k, embedding_model, or chunk_size. This makes experiment comparisons much easier.

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.