Back to Docs
Core Concepts

Evaluations

Learn about different RAG evaluation approaches

Last updated: December 19, 2024
Category: concepts

Evaluations

Vecta supports multiple evaluation approaches to comprehensively test different components of your RAG system. Choose the right evaluation type based on what you want to optimize.

Overview

Evaluation TypeWhat It TestsKey MetricsUse Cases
Ingestion OnlyData processing and chunkingLevenshtein, BLEU, DurationDatabase optimization, chunking strategies
Retrieval OnlySearch and ranking qualityPrecision, Recall, F1Vector search tuning, embedding models
Retrieval + GenerationFull RAG pipelineRecall, Precision, F1, Accuracy, FactualityEnd-to-end system validation
Generation OnlyLLM response qualityAccuracy, Factuality, DurationPrompt engineering, model comparison

Ingestion Only Evaluation

Coming Soon - Currently in development

Tests how well your system processes and stores documents for retrieval. Perfect for optimizing data pipelines and chunking strategies.

What It Measures

  • Content Preservation: How much of the original information is retained during ingestion
  • Structural Integrity: Whether document structure (headings, lists, tables) is maintained
  • Processing Speed: Time taken to ingest and index documents
  • Chunking Quality: How effectively documents are split into retrievable pieces

Key Metrics

  • Levenshtein Distance: Character-level similarity between original and processed text
  • BLEU Score: N-gram overlap measuring content preservation
  • Accuracy: Percentage of content correctly ingested
  • Duration: Processing time per document or batch

Example Use Cases

# Coming soon - ingestion evaluation
results = client.evaluate_ingestion(
    documents=my_documents,
    ingestion_function=my_ingestion_pipeline,
    chunking_strategy="semantic"
)

Retrieval Only Evaluation

Tests how effectively your system finds relevant information without considering answer generation.

What It Measures

  • Relevance: Do retrieved chunks contain information needed to answer the question?
  • Coverage: Are all relevant pieces of information found?
  • Precision vs Recall Trade-offs: Balance between accuracy and completeness

Key Metrics

  • Precision: Of retrieved chunks, how many are relevant?
  • Recall: Of all relevant chunks, how many were retrieved?
  • F1 Score: Harmonic mean of precision and recall
  • Duration: Time taken to perform retrieval

Granularity Levels

Vecta measures retrieval quality at three levels:

  1. Chunk Level: Individual text segments that contain the answer
  2. Page Level: Document pages that contain relevant information
  3. Document Level: Entire documents that are relevant to the query

Example Implementation

def my_retrieval_function(query: str) -> list[str]:
    """Return chunk IDs for the most relevant content."""
    # Your retrieval logic here
    results = vector_db.search(query, k=10)
    return [result.id for result in results]

# Evaluate retrieval performance
results = client.evaluate_retrieval(
    benchmark_id="my-benchmark",
    retrieval_function=my_retrieval_function,
    evaluation_name="Vector Search Optimization v1"
)

print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")
print(f"Page F1: {results.page_level.f1_score:.3f}")
print(f"Document F1: {results.document_level.f1_score:.3f}")

When to Use

  • Optimizing vector databases: Compare embedding models, distance metrics, indexing parameters
  • Tuning search parameters: Find optimal k values, similarity thresholds
  • A/B testing retrieval strategies: Hybrid search, reranking, query expansion
  • Debugging retrieval failures: Identify why relevant content isn't being found

Retrieval + Generation Evaluation

Evaluates your complete RAG pipeline, measuring both retrieval quality and answer generation accuracy.

What It Measures

  • Retrieval Performance: All metrics from retrieval-only evaluation
  • Answer Quality: How well the LLM generates responses from retrieved context
  • Information Synthesis: Ability to combine information from multiple sources
  • Factual Accuracy: Whether generated answers are truthful and grounded

Key Metrics

Retrieval Metrics:

  • Precision, Recall, F1 at chunk/page/document levels Generation Metrics:
  • Accuracy: Semantic similarity between generated and expected answers
  • Factuality: Whether the answer is factually correct and grounded in sources
  • Duration: Time taken to generate the answer

Example Implementation

def my_rag_function(query: str) -> tuple[list[str], str]:
    """Return (chunk_ids, generated_answer) for a query."""
    # 1. Retrieve relevant context
    search_results = vector_db.search(query, k=5)
    chunk_ids = [result.id for result in search_results]
    
    # 2. Build context string
    context = "\n".join([result.content for result in search_results])
    
    # 3. Generate answer
    prompt = f"""Answer the question based on the provided context.
    
Context: {context}

Question: {query}

Answer:"""
    
    answer = llm.generate(prompt)
    
    return chunk_ids, answer

# Evaluate full RAG pipeline
results = client.evaluate_retrieval_and_generation(
    benchmark_id="my-benchmark",
    retrieval_generation_function=my_rag_function,
    evaluation_name="Full RAG Pipeline v2.1"
)

# View comprehensive results
print(f"Retrieval F1: {results.chunk_level.f1_score:.3f}")
print(f"Generation Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Generation Factuality: {results.generation_metrics.factuality:.3f}")
print(f"Generation Duration: {results.generation_metrics.duration:.2f}s")

When to Use

  • End-to-end system validation: Before deploying to production
  • Comparing RAG architectures: Different retrieval + generation strategies
  • Monitoring production performance: Detecting regressions in live systems
  • Customer-facing quality assurance: Ensuring user experience meets standards

Generation Only Evaluation

Tests LLM response quality without retrieval, perfect for prompt engineering and model comparison.

Key Metrics

  • Accuracy: Semantic correctness of the response
  • Factuality: Truthfulness and absence of hallucinations
  • Duration: Response generation time

Example Implementation

def my_generation_function(query: str) -> str:
    """Generate answer without retrieval context."""
    prompt = f"""You are a helpful AI assistant. Answer the following question accurately and concisely.

Question: {query}

Answer:"""
    
    return llm.generate(prompt)

# Evaluate generation quality
results = client.evaluate_generation_only(
    benchmark_id="my-benchmark",
    generation_function=my_generation_function,
    evaluation_name="LLM Comparison: GPT-4 vs Claude"
)

print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Factuality: {results.generation_metrics.factuality:.3f}")

When to Use

  • Prompt engineering: Optimizing prompts for better responses
  • Model comparison: Evaluating different LLMs (GPT-4, Claude, Llama, etc.)
  • Fine-tuning validation: Testing custom-trained models
  • Baseline establishment: Understanding base LLM performance before adding RAG

Choosing the Right Evaluation Type

Decision Framework

Do you need to test retrieval?
ā”œā”€ No → Generation Only
│   └─ Use for: Prompt engineering, model comparison
└─ Yes → Do you also need to test generation?
    ā”œā”€ Yes → Retrieval + Generation  
    │   └─ Use for: Full pipeline validation, production monitoring
    └─ No → Retrieval Only
        └─ Use for: Vector search optimization, database tuning

Evaluation Strategy by Development Phase

šŸ”¬ Research & Development

  • Start with Generation Only to establish LLM baselines
  • Move to Retrieval Only to optimize search components
  • Combine with Retrieval + Generation for end-to-end validation

šŸ”§ System Optimization

  • Use Retrieval Only for database and embedding experiments
  • Use Generation Only for prompt and model A/B tests
  • Validate changes with Retrieval + Generation

šŸš€ Production Deployment

  • Primary: Retrieval + Generation for comprehensive monitoring
  • Secondary: Retrieval Only for search performance tracking
  • Incident Response: Generation Only for LLM-specific issues

Advanced Evaluation Patterns

Multi-Stage Evaluation

# 1. Optimize retrieval first
retrieval_results = client.evaluate_retrieval(
    benchmark_id=benchmark_id,
    retrieval_function=my_retrieval_function
)

# 2. Then optimize generation
generation_results = client.evaluate_generation_only(
    benchmark_id=benchmark_id,
    generation_function=my_generation_function
)

# 3. Finally, test the complete pipeline
full_results = client.evaluate_retrieval_and_generation(
    benchmark_id=benchmark_id,
    retrieval_generation_function=my_rag_function
)

Next Steps

Ready to start evaluating? Check out our Cloud Quickstart or Local SDK guide.

Need Help?

Can't find what you're looking for? Our team is here to help.