Back to Docs
Evaluations
Retrieval + Generation
Evaluate your complete RAG pipeline end to end
Last updated: August 20, 2025
Category: evaluations
Retrieval + Generation Evaluation
A retrieval + generation evaluation measures your complete RAG pipeline end to end — both how well you find the right context and how well you generate answers from it.
Function Signature
Your function must accept a query string and return a tuple of (chunk_ids, generated_answer):
def my_rag_pipeline(query: str) -> tuple[list[str], str]:
"""Return (retrieved_chunk_ids, generated_answer)."""
# 1. Retrieve
results = vector_db.search(query, k=5)
chunk_ids = [r.id for r in results]
# 2. Generate
context = "\n".join([r.text for r in results])
response = llm.chat(
messages=[
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": query},
]
)
answer = response.choices[0].message.content
return chunk_ids, answer
Using the API Client
from vecta import VectaAPIClient
client = VectaAPIClient()
results = client.evaluate_retrieval_and_generation(
benchmark_id="your-benchmark-id",
retrieval_generation_function=my_rag_pipeline,
evaluation_name="rag-v1-k5",
metadata={"top_k": 5, "model": "gpt-4o", "chunk_size": 512},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
benchmark_id | str | required | ID of an active benchmark |
retrieval_generation_function | Callable[[str], Tuple[List[str], str]] | required | Your RAG function |
evaluation_name | str | "API RAG Evaluation" | Human-readable name |
experiment_id | str | None | None | Optional experiment to group under |
metadata | dict | None | None | Arbitrary key-value metadata |
Returns: RetrievalAndGenerationResults
Using the Local Client
from vecta import VectaClient
vecta = VectaClient(
data_source_connector=my_connector,
openai_api_key="sk-...", # needed for LLM-as-judge scoring
)
vecta.load_knowledge_base()
vecta.load_benchmark("my_benchmark.csv")
results = vecta.evaluate_retrieval_and_generation(
my_rag_pipeline,
evaluation_name="rag-v1-k5",
)
Reading Results
# Retrieval metrics
print(f"Chunk F1: {results.chunk_level.f1_score:.2%}")
print(f"Document F1: {results.document_level.f1_score:.2%}")
if results.page_level:
print(f"Page F1: {results.page_level.f1_score:.2%}")
# Generation metrics
print(f"Accuracy: {results.generation_metrics.accuracy:.2%}")
print(f"Groundedness: {results.generation_metrics.groundedness:.2%}")
print(f"Duration: {results.duration_seconds}s")
print(f"Questions: {results.total_questions}")
# Per-question details
for row in results.detailed_results:
print(f"Q: {row.question[:50]}...")
print(f" Chunk F1: {row.chunk_f1}")
print(f" Accuracy: {row.accuracy}")
print(f" Groundedness: {row.groundedness}")
print(f" Retrieved: {row.retrieved_chunk_ids[:3]}...")
print(f" Generated: {row.generated_answer[:60]}...")
Result Schema
The RetrievalAndGenerationResults object contains:
| Field | Type | Description |
|---|---|---|
chunk_level | EvaluationMetrics | Chunk-level precision, recall, F1 |
page_level | EvaluationMetrics | None | Page-level metrics (if available) |
document_level | EvaluationMetrics | Document-level metrics |
generation_metrics | GenerationMetrics | Accuracy and groundedness |
detailed_results | List[EvaluationResultRow] | Per-question breakdowns |
total_questions | int | Number of questions evaluated |
duration_seconds | int | Wall-clock time |
evaluation_name | str | Name of this run |
metadata | dict | None | Attached metadata |
Server-Side Evaluation
If you've already collected retrieval results and generated answers elsewhere, you can submit them to the server for scoring without re-running your pipeline:
evaluation_results = [
{
"question": "What is X?",
"retrieved_chunk_ids": ["chunk_1", "chunk_2"],
"generated_answer": "X is ...",
},
# ... more results
]
results = client.evaluate_retrieval_and_generation_server(
benchmark_id="your-benchmark-id",
evaluation_name="pre-computed-rag",
evaluation_results=evaluation_results,
)
The server runs the LLM-as-a-judge scoring and computes all metrics.
Tips
- Return order matters — Your function must return
(chunk_ids, answer), not(answer, chunk_ids). - Benchmark needs
chunk_ids— For retrieval metrics to work, benchmark entries must havechunk_ids. If they don't (e.g., CSV upload without ground truth), only generation metrics will be meaningful. - Use experiments — When tuning
top_k, embedding models, or prompts, attach metadata and group runs into an Experiment for visual comparison.
Next Steps
- Experiments — Compare RAG configurations systematically
- CI/CD — Automate evaluation in your deployment pipeline