Back to Docs
Evaluations
Overview
Understanding evaluation types and metrics
Last updated: August 19, 2025
Category: evaluations
Evaluations
Test your RAG system with three evaluation types. Each measures different components of your pipeline.
Evaluation Types
Retrieval Only
What: Test search quality without generation.
Metrics: Precision, Recall, F1
Use for:
- Comparing embedding models
- Tuning vector search parameters
- A/B testing retrieval strategies
Generation Only
What: Test LLM response quality without retrieval.
Metrics: Accuracy, Groundedness
Use for:
- Comparing LLM models
- Prompt engineering
- Testing reasoning ability
Retrieval + Generation
What: Test your complete RAG pipeline.
Metrics: All retrieval metrics + all generation metrics
Use for:
- End-to-end validation
- Production monitoring
- Identifying bottlenecks

Figure: Compare retrieval, generation, and end-to-end metrics side by side to choose the right evaluation type.
Quick Examples
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
# Retrieval only
def my_retriever(query: str) -> list[str]:
results = vector_db.search(query, k=10)
return [r.id for r in results]
results = client.evaluate_retrieval(
benchmark_id="benchmark-id",
retrieval_function=my_retriever
)
print(f"F1: {results.chunk_level.f1_score:.3f}")
# Generation only
def my_generator(query: str) -> str:
return llm.generate(query)
results = client.evaluate_generation_only(
benchmark_id="benchmark-id",
generation_function=my_generator
)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
# Full RAG
def my_rag(query: str) -> tuple[list[str], str]:
chunks = vector_db.search(query, k=5)
chunk_ids = [c.id for c in chunks]
context = "\n".join([c.content for c in chunks])
answer = llm.generate(f"Context: {context}\n\nQ: {query}")
return chunk_ids, answer
results = client.evaluate_retrieval_and_generation(
benchmark_id="benchmark-id",
retrieval_generation_function=my_rag
)
Understanding Metrics

Figure: Precision and recall show how many relevant chunks you retrieved versus how many relevant chunks were found overall.
Retrieval Metrics
- Precision: Of retrieved chunks, how many were relevant?
- Recall: Of relevant chunks, how many were retrieved?
- F1: Harmonic mean of precision and recall
Generation Metrics
- Accuracy: Does the answer match expected answer semantically?
- Groundedness: Is the answer truthful and consistent?
Evaluation Levels
All evaluations measure at three levels:
results = client.evaluate_retrieval(...)
# Chunk level - individual text segments
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")
# Page level - document pages
print(f"Page F1: {results.page_level.f1_score:.3f}")
# Document level - entire documents
print(f"Doc F1: {results.document_level.f1_score:.3f}")
Which Evaluation Type?
Testing retrieval?
āā No ā Generation Only
ā āā Use for: Prompt engineering, model comparison
āā Yes ā Also testing generation?
āā Yes ā Retrieval + Generation
ā āā Use for: End-to-end validation, production
āā No ā Retrieval Only
āā Use for: Search optimization, database tuning
Next Steps
- Retrieval Only ā - Test search quality
- Generation Only ā - Test LLM responses
- Retrieval + Generation ā - Test full pipeline