Evaluations
Learn about different RAG evaluation approaches
Evaluations
Vecta supports multiple evaluation approaches to comprehensively test different components of your RAG system. Choose the right evaluation type based on what you want to optimize.
Overview
Evaluation Type | What It Tests | Key Metrics | Use Cases |
---|---|---|---|
Ingestion Only | Data processing and chunking | Levenshtein, BLEU, Duration | Database optimization, chunking strategies |
Retrieval Only | Search and ranking quality | Precision, Recall, F1 | Vector search tuning, embedding models |
Retrieval + Generation | Full RAG pipeline | Recall, Precision, F1, Accuracy, Factuality | End-to-end system validation |
Generation Only | LLM response quality | Accuracy, Factuality, Duration | Prompt engineering, model comparison |
Ingestion Only Evaluation
Coming Soon - Currently in development
Tests how well your system processes and stores documents for retrieval. Perfect for optimizing data pipelines and chunking strategies.
What It Measures
- Content Preservation: How much of the original information is retained during ingestion
- Structural Integrity: Whether document structure (headings, lists, tables) is maintained
- Processing Speed: Time taken to ingest and index documents
- Chunking Quality: How effectively documents are split into retrievable pieces
Key Metrics
- Levenshtein Distance: Character-level similarity between original and processed text
- BLEU Score: N-gram overlap measuring content preservation
- Accuracy: Percentage of content correctly ingested
- Duration: Processing time per document or batch
Example Use Cases
# Coming soon - ingestion evaluation
results = client.evaluate_ingestion(
documents=my_documents,
ingestion_function=my_ingestion_pipeline,
chunking_strategy="semantic"
)
Retrieval Only Evaluation
Tests how effectively your system finds relevant information without considering answer generation.
What It Measures
- Relevance: Do retrieved chunks contain information needed to answer the question?
- Coverage: Are all relevant pieces of information found?
- Precision vs Recall Trade-offs: Balance between accuracy and completeness
Key Metrics
- Precision: Of retrieved chunks, how many are relevant?
- Recall: Of all relevant chunks, how many were retrieved?
- F1 Score: Harmonic mean of precision and recall
- Duration: Time taken to perform retrieval
Granularity Levels
Vecta measures retrieval quality at three levels:
- Chunk Level: Individual text segments that contain the answer
- Page Level: Document pages that contain relevant information
- Document Level: Entire documents that are relevant to the query
Example Implementation
def my_retrieval_function(query: str) -> list[str]:
"""Return chunk IDs for the most relevant content."""
# Your retrieval logic here
results = vector_db.search(query, k=10)
return [result.id for result in results]
# Evaluate retrieval performance
results = client.evaluate_retrieval(
benchmark_id="my-benchmark",
retrieval_function=my_retrieval_function,
evaluation_name="Vector Search Optimization v1"
)
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")
print(f"Page F1: {results.page_level.f1_score:.3f}")
print(f"Document F1: {results.document_level.f1_score:.3f}")
When to Use
- Optimizing vector databases: Compare embedding models, distance metrics, indexing parameters
- Tuning search parameters: Find optimal
k
values, similarity thresholds - A/B testing retrieval strategies: Hybrid search, reranking, query expansion
- Debugging retrieval failures: Identify why relevant content isn't being found
Retrieval + Generation Evaluation
Evaluates your complete RAG pipeline, measuring both retrieval quality and answer generation accuracy.
What It Measures
- Retrieval Performance: All metrics from retrieval-only evaluation
- Answer Quality: How well the LLM generates responses from retrieved context
- Information Synthesis: Ability to combine information from multiple sources
- Factual Accuracy: Whether generated answers are truthful and grounded
Key Metrics
Retrieval Metrics:
- Precision, Recall, F1 at chunk/page/document levels Generation Metrics:
- Accuracy: Semantic similarity between generated and expected answers
- Factuality: Whether the answer is factually correct and grounded in sources
- Duration: Time taken to generate the answer
Example Implementation
def my_rag_function(query: str) -> tuple[list[str], str]:
"""Return (chunk_ids, generated_answer) for a query."""
# 1. Retrieve relevant context
search_results = vector_db.search(query, k=5)
chunk_ids = [result.id for result in search_results]
# 2. Build context string
context = "\n".join([result.content for result in search_results])
# 3. Generate answer
prompt = f"""Answer the question based on the provided context.
Context: {context}
Question: {query}
Answer:"""
answer = llm.generate(prompt)
return chunk_ids, answer
# Evaluate full RAG pipeline
results = client.evaluate_retrieval_and_generation(
benchmark_id="my-benchmark",
retrieval_generation_function=my_rag_function,
evaluation_name="Full RAG Pipeline v2.1"
)
# View comprehensive results
print(f"Retrieval F1: {results.chunk_level.f1_score:.3f}")
print(f"Generation Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Generation Factuality: {results.generation_metrics.factuality:.3f}")
print(f"Generation Duration: {results.generation_metrics.duration:.2f}s")
When to Use
- End-to-end system validation: Before deploying to production
- Comparing RAG architectures: Different retrieval + generation strategies
- Monitoring production performance: Detecting regressions in live systems
- Customer-facing quality assurance: Ensuring user experience meets standards
Generation Only Evaluation
Tests LLM response quality without retrieval, perfect for prompt engineering and model comparison.
Key Metrics
- Accuracy: Semantic correctness of the response
- Factuality: Truthfulness and absence of hallucinations
- Duration: Response generation time
Example Implementation
def my_generation_function(query: str) -> str:
"""Generate answer without retrieval context."""
prompt = f"""You are a helpful AI assistant. Answer the following question accurately and concisely.
Question: {query}
Answer:"""
return llm.generate(prompt)
# Evaluate generation quality
results = client.evaluate_generation_only(
benchmark_id="my-benchmark",
generation_function=my_generation_function,
evaluation_name="LLM Comparison: GPT-4 vs Claude"
)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Factuality: {results.generation_metrics.factuality:.3f}")
When to Use
- Prompt engineering: Optimizing prompts for better responses
- Model comparison: Evaluating different LLMs (GPT-4, Claude, Llama, etc.)
- Fine-tuning validation: Testing custom-trained models
- Baseline establishment: Understanding base LLM performance before adding RAG
Choosing the Right Evaluation Type
Decision Framework
Do you need to test retrieval?
āā No ā Generation Only
ā āā Use for: Prompt engineering, model comparison
āā Yes ā Do you also need to test generation?
āā Yes ā Retrieval + Generation
ā āā Use for: Full pipeline validation, production monitoring
āā No ā Retrieval Only
āā Use for: Vector search optimization, database tuning
Evaluation Strategy by Development Phase
š¬ Research & Development
- Start with Generation Only to establish LLM baselines
- Move to Retrieval Only to optimize search components
- Combine with Retrieval + Generation for end-to-end validation
š§ System Optimization
- Use Retrieval Only for database and embedding experiments
- Use Generation Only for prompt and model A/B tests
- Validate changes with Retrieval + Generation
š Production Deployment
- Primary: Retrieval + Generation for comprehensive monitoring
- Secondary: Retrieval Only for search performance tracking
- Incident Response: Generation Only for LLM-specific issues
Advanced Evaluation Patterns
Multi-Stage Evaluation
# 1. Optimize retrieval first
retrieval_results = client.evaluate_retrieval(
benchmark_id=benchmark_id,
retrieval_function=my_retrieval_function
)
# 2. Then optimize generation
generation_results = client.evaluate_generation_only(
benchmark_id=benchmark_id,
generation_function=my_generation_function
)
# 3. Finally, test the complete pipeline
full_results = client.evaluate_retrieval_and_generation(
benchmark_id=benchmark_id,
retrieval_generation_function=my_rag_function
)
Next Steps
Ready to start evaluating? Check out our Cloud Quickstart or Local SDK guide.