Evaluations

Test your RAG system with three evaluation types. Each measures different components of your pipeline.

Evaluation Types

Retrieval Only

What: Test search quality without generation.

Metrics: Precision, Recall, F1

Use for:

Comparing embedding models
Tuning vector search parameters
A/B testing retrieval strategies

Learn more →

Generation Only

What: Test LLM response quality without retrieval.

Metrics: Accuracy, Groundedness

Use for:

Comparing LLM models
Prompt engineering
Testing reasoning ability

Learn more →

Retrieval + Generation

What: Test your complete RAG pipeline.

Metrics: All retrieval metrics + all generation metrics

Use for:

End-to-end validation
Production monitoring
Identifying bottlenecks

Learn more →

Illustration of Vecta evaluation dashboards across retrieval and generation metrics — **Figure:** Compare retrieval, generation, and end-to-end metrics side by side to choose the right evaluation type.

Quick Examples

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")

# Retrieval only
def my_retriever(query: str) -> list[str]:
    results = vector_db.search(query, k=10)
    return [r.id for r in results]

results = client.evaluate_retrieval(
    benchmark_id="benchmark-id",
    retrieval_function=my_retriever
)
print(f"F1: {results.chunk_level.f1_score:.3f}")

# Generation only
def my_generator(query: str) -> str:
    return llm.generate(query)

results = client.evaluate_generation_only(
    benchmark_id="benchmark-id",
    generation_function=my_generator
)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")

# Full RAG
def my_rag(query: str) -> tuple[list[str], str]:
    chunks = vector_db.search(query, k=5)
    chunk_ids = [c.id for c in chunks]
    context = "\n".join([c.content for c in chunks])
    answer = llm.generate(f"Context: {context}\n\nQ: {query}")
    return chunk_ids, answer

results = client.evaluate_retrieval_and_generation(
    benchmark_id="benchmark-id",
    retrieval_generation_function=my_rag
)

Understanding Metrics

Precision and recall curve comparing relevant and retrieved results — **Figure:** Precision and recall show how many relevant chunks you retrieved versus how many relevant chunks were found overall.

Retrieval Metrics

Precision: Of retrieved chunks, how many were relevant?
Recall: Of relevant chunks, how many were retrieved?
F1: Harmonic mean of precision and recall

Generation Metrics

Accuracy: Does the answer match expected answer semantically?
Groundedness: Is the answer truthful and consistent?

Evaluation Levels

All evaluations measure at three levels:

results = client.evaluate_retrieval(...)

# Chunk level - individual text segments
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")

# Page level - document pages
print(f"Page F1: {results.page_level.f1_score:.3f}")

# Document level - entire documents
print(f"Doc F1: {results.document_level.f1_score:.3f}")

Which Evaluation Type?

Testing retrieval?
├─ No → Generation Only
│   └─ Use for: Prompt engineering, model comparison
└─ Yes → Also testing generation?
    ├─ Yes → Retrieval + Generation
    │   └─ Use for: End-to-end validation, production
    └─ No → Retrieval Only
        └─ Use for: Search optimization, database tuning

Next Steps

Retrieval Only → - Test search quality
Generation Only → - Test LLM responses
Retrieval + Generation → - Test full pipeline

Overview

Evaluations

Evaluation Types

Retrieval Only

Generation Only

Retrieval + Generation

Quick Examples

Understanding Metrics

Retrieval Metrics

Generation Metrics

Evaluation Levels

Which Evaluation Type?

Next Steps

Need Help?