Back to Docs
Evaluations

Overview

Understanding evaluation types and metrics

Last updated: August 19, 2025
Category: evaluations

Evaluations

Test your RAG system with three evaluation types. Each measures different components of your pipeline.

Evaluation Types

Retrieval Only

What: Test search quality without generation.

Metrics: Precision, Recall, F1

Use for:

  • Comparing embedding models
  • Tuning vector search parameters
  • A/B testing retrieval strategies

Learn more →

Generation Only

What: Test LLM response quality without retrieval.

Metrics: Accuracy, Groundedness

Use for:

  • Comparing LLM models
  • Prompt engineering
  • Testing reasoning ability

Learn more →

Retrieval + Generation

What: Test your complete RAG pipeline.

Metrics: All retrieval metrics + all generation metrics

Use for:

  • End-to-end validation
  • Production monitoring
  • Identifying bottlenecks

Learn more →

Illustration of Vecta evaluation dashboards across retrieval and generation metrics

Figure: Compare retrieval, generation, and end-to-end metrics side by side to choose the right evaluation type.

Quick Examples

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")

# Retrieval only
def my_retriever(query: str) -> list[str]:
    results = vector_db.search(query, k=10)
    return [r.id for r in results]

results = client.evaluate_retrieval(
    benchmark_id="benchmark-id",
    retrieval_function=my_retriever
)
print(f"F1: {results.chunk_level.f1_score:.3f}")

# Generation only
def my_generator(query: str) -> str:
    return llm.generate(query)

results = client.evaluate_generation_only(
    benchmark_id="benchmark-id",
    generation_function=my_generator
)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")

# Full RAG
def my_rag(query: str) -> tuple[list[str], str]:
    chunks = vector_db.search(query, k=5)
    chunk_ids = [c.id for c in chunks]
    context = "\n".join([c.content for c in chunks])
    answer = llm.generate(f"Context: {context}\n\nQ: {query}")
    return chunk_ids, answer

results = client.evaluate_retrieval_and_generation(
    benchmark_id="benchmark-id",
    retrieval_generation_function=my_rag
)

Understanding Metrics

Precision and recall curve comparing relevant and retrieved results

Figure: Precision and recall show how many relevant chunks you retrieved versus how many relevant chunks were found overall.

Retrieval Metrics

  • Precision: Of retrieved chunks, how many were relevant?
  • Recall: Of relevant chunks, how many were retrieved?
  • F1: Harmonic mean of precision and recall

Generation Metrics

  • Accuracy: Does the answer match expected answer semantically?
  • Groundedness: Is the answer truthful and consistent?

Evaluation Levels

All evaluations measure at three levels:

results = client.evaluate_retrieval(...)

# Chunk level - individual text segments
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")

# Page level - document pages
print(f"Page F1: {results.page_level.f1_score:.3f}")

# Document level - entire documents
print(f"Doc F1: {results.document_level.f1_score:.3f}")

Which Evaluation Type?

Testing retrieval?
ā”œā”€ No → Generation Only
│   └─ Use for: Prompt engineering, model comparison
└─ Yes → Also testing generation?
    ā”œā”€ Yes → Retrieval + Generation
    │   └─ Use for: End-to-end validation, production
    └─ No → Retrieval Only
        └─ Use for: Search optimization, database tuning

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.