Back to Docs
Evaluations
Retrieval + Generation
Test your complete RAG pipeline
Last updated: August 19, 2025
Category: evaluations
Retrieval + Generation Evaluation
Test your complete RAG pipeline. Measures both retrieval quality and generation accuracy.
Quick Example
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
# Define RAG function
def my_rag(query: str) -> tuple[list[str], str]:
# Retrieve
chunks = vector_db.search(query, k=5)
chunk_ids = [c.id for c in chunks]
# Generate
context = "\n".join([c.content for c in chunks])
answer = llm.generate(f"Context: {context}\n\nQ: {query}")
return chunk_ids, answer
# Run evaluation
results = client.evaluate_retrieval_and_generation(
benchmark_id="benchmark-id",
retrieval_generation_function=my_rag,
evaluation_name="Production RAG v2"
)
# View results
print("Retrieval:")
print(f" Chunk F1: {results.chunk_level.f1_score:.3f}")
print(f" Page F1: {results.page_level.f1_score:.3f}")
print(f" Doc F1: {results.document_level.f1_score:.3f}")
print("\nGeneration:")
print(f" Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f" Groundedness: {results.generation_metrics.groundedness:.3f}")
All Metrics
Retrieval:
- Precision, Recall, F1 at chunk/page/document levels
Generation:
- Accuracy (semantic similarity to expected answer)
- Groundedness (absence of hallucinations)
Local SDK
from vecta import VectaClient
vecta = VectaClient(
vector_db_connector=connector,
openai_api_key="your-key"
)
vecta.load_benchmark("benchmark.csv")
def my_rag(query: str) -> tuple[list[str], str]:
chunks = connector.semantic_search(query, k=5)
chunk_ids = [c.id for c in chunks]
context = "\n".join([c.content for c in chunks])
prompt = f"Answer based on context:\n\n{context}\n\nQ: {query}"
answer = llm.generate(prompt)
return chunk_ids, answer
results = vecta.evaluate_retrieval_and_generation(my_rag)
Common RAG Patterns
Simple RAG
def simple_rag(query: str) -> tuple[list[str], str]:
# Search
chunks = vector_db.search(query, k=5)
chunk_ids = [c.id for c in chunks]
# Build context
context = "\n\n".join([
f"Document: {c.source_path}\n{c.content}"
for c in chunks
])
# Generate
prompt = f"""Use the following context to answer the question.
Context:
{context}
Question: {query}
Answer:"""
answer = llm.generate(prompt)
return chunk_ids, answer
RAG with Reranking
def rag_with_reranking(query: str) -> tuple[list[str], str]:
# Initial retrieval
candidates = vector_db.search(query, k=20)
# Rerank for relevance
reranked = reranker.rerank(query, candidates)
top_chunks = reranked[:5]
chunk_ids = [c.id for c in top_chunks]
# Generate
context = "\n\n".join([c.content for c in top_chunks])
answer = llm.generate(f"Context:\n{context}\n\nQ: {query}")
return chunk_ids, answer
Multi-Query RAG
def multi_query_rag(query: str) -> tuple[list[str], str]:
# Generate variations
variations = llm.generate_variations(query, n=3)
# Search with each
all_chunks = []
for variant in variations:
chunks = vector_db.search(variant, k=10)
all_chunks.extend(chunks)
# Deduplicate and rerank
unique = deduplicate(all_chunks)
top_chunks = reranker.rerank(query, unique)[:5]
chunk_ids = [c.id for c in top_chunks]
# Generate
context = "\n\n".join([c.content for c in top_chunks])
answer = llm.generate(f"Context:\n{context}\n\nQ: {query}")
return chunk_ids, answer
Comparing Approaches
strategies = {
"Simple RAG": simple_rag,
"With Reranking": rag_with_reranking,
"Multi-Query": multi_query_rag
}
for name, rag_fn in strategies.items():
results = client.evaluate_retrieval_and_generation(
benchmark_id=benchmark_id,
retrieval_generation_function=rag_fn,
evaluation_name=name
)
print(f"\n{name}:")
print(f" Retrieval F1: {results.chunk_level.f1_score:.3f}")
print(f" Accuracy: {results.generation_metrics.accuracy:.3f}")
Interpreting Results
Good retrieval, poor generation:
- You're finding the right chunks
- Problem is in prompt or LLM
- Fix: Improve prompt, try better model
Poor retrieval, good generation:
- LLM is doing well with what it gets
- Not finding the right information
- Fix: Improve embeddings, add reranking
Both poor:
- System-wide issues
- Fix: Start with retrieval optimization
Both good:
- System is working well
- Monitor for regressions
Production Monitoring
# Run daily evaluations
def daily_eval():
results = client.evaluate_retrieval_and_generation(
benchmark_id=production_benchmark_id,
retrieval_generation_function=production_rag,
evaluation_name=f"Daily {datetime.now().date()}"
)
# Alert if performance drops
if results.chunk_level.f1_score < 0.7:
alert_team("Retrieval F1 below threshold")
if results.generation_metrics.accuracy < 0.8:
alert_team("Generation accuracy below threshold")
Next Steps
- CI/CD Integration → - Run evaluations in CI
- Retrieval Only → - Focus on search
- Generation Only → - Focus on LLM