Back to Docs
Evaluations

Retrieval + Generation

Evaluate your complete RAG pipeline end to end

Last updated: August 20, 2025
Category: evaluations

Retrieval + Generation Evaluation

A retrieval + generation evaluation measures your complete RAG pipeline end to end — both how well you find the right context and how well you generate answers from it.

Function Signature

Your function must accept a query string and return a tuple of (chunk_ids, generated_answer):

def my_rag_pipeline(query: str) -> tuple[list[str], str]:
    """Return (retrieved_chunk_ids, generated_answer)."""
    # 1. Retrieve
    results = vector_db.search(query, k=5)
    chunk_ids = [r.id for r in results]

    # 2. Generate
    context = "\n".join([r.text for r in results])
    response = llm.chat(
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query},
        ]
    )
    answer = response.choices[0].message.content

    return chunk_ids, answer

Using the API Client

from vecta import VectaAPIClient

client = VectaAPIClient()

results = client.evaluate_retrieval_and_generation(
    benchmark_id="your-benchmark-id",
    retrieval_generation_function=my_rag_pipeline,
    evaluation_name="rag-v1-k5",
    metadata={"top_k": 5, "model": "gpt-4o", "chunk_size": 512},
)
ParameterTypeDefaultDescription
benchmark_idstrrequiredID of an active benchmark
retrieval_generation_functionCallable[[str], Tuple[List[str], str]]requiredYour RAG function
evaluation_namestr"API RAG Evaluation"Human-readable name
experiment_idstr | NoneNoneOptional experiment to group under
metadatadict | NoneNoneArbitrary key-value metadata

Returns: RetrievalAndGenerationResults

Using the Local Client

from vecta import VectaClient

vecta = VectaClient(
    data_source_connector=my_connector,
    openai_api_key="sk-...",  # needed for LLM-as-judge scoring
)
vecta.load_knowledge_base()
vecta.load_benchmark("my_benchmark.csv")

results = vecta.evaluate_retrieval_and_generation(
    my_rag_pipeline,
    evaluation_name="rag-v1-k5",
)

Reading Results

# Retrieval metrics
print(f"Chunk F1:    {results.chunk_level.f1_score:.2%}")
print(f"Document F1: {results.document_level.f1_score:.2%}")

if results.page_level:
    print(f"Page F1:     {results.page_level.f1_score:.2%}")

# Generation metrics
print(f"Accuracy:     {results.generation_metrics.accuracy:.2%}")
print(f"Groundedness: {results.generation_metrics.groundedness:.2%}")

print(f"Duration: {results.duration_seconds}s")
print(f"Questions: {results.total_questions}")

# Per-question details
for row in results.detailed_results:
    print(f"Q: {row.question[:50]}...")
    print(f"  Chunk F1:      {row.chunk_f1}")
    print(f"  Accuracy:      {row.accuracy}")
    print(f"  Groundedness:  {row.groundedness}")
    print(f"  Retrieved:     {row.retrieved_chunk_ids[:3]}...")
    print(f"  Generated:     {row.generated_answer[:60]}...")

Result Schema

The RetrievalAndGenerationResults object contains:

FieldTypeDescription
chunk_levelEvaluationMetricsChunk-level precision, recall, F1
page_levelEvaluationMetrics | NonePage-level metrics (if available)
document_levelEvaluationMetricsDocument-level metrics
generation_metricsGenerationMetricsAccuracy and groundedness
detailed_resultsList[EvaluationResultRow]Per-question breakdowns
total_questionsintNumber of questions evaluated
duration_secondsintWall-clock time
evaluation_namestrName of this run
metadatadict | NoneAttached metadata

Server-Side Evaluation

If you've already collected retrieval results and generated answers elsewhere, you can submit them to the server for scoring without re-running your pipeline:

evaluation_results = [
    {
        "question": "What is X?",
        "retrieved_chunk_ids": ["chunk_1", "chunk_2"],
        "generated_answer": "X is ...",
    },
    # ... more results
]

results = client.evaluate_retrieval_and_generation_server(
    benchmark_id="your-benchmark-id",
    evaluation_name="pre-computed-rag",
    evaluation_results=evaluation_results,
)

The server runs the LLM-as-a-judge scoring and computes all metrics.

Tips

  • Return order matters — Your function must return (chunk_ids, answer), not (answer, chunk_ids).
  • Benchmark needs chunk_ids — For retrieval metrics to work, benchmark entries must have chunk_ids. If they don't (e.g., CSV upload without ground truth), only generation metrics will be meaningful.
  • Use experiments — When tuning top_k, embedding models, or prompts, attach metadata and group runs into an Experiment for visual comparison.

Next Steps

  • Experiments — Compare RAG configurations systematically
  • CI/CD — Automate evaluation in your deployment pipeline

Need Help?

Can't find what you're looking for? Our team is here to help.