Datasets
Working with evaluation datasets and imports
Datasets
Vecta supports importing and working with popular evaluation datasets for RAG system benchmarking. This allows you to leverage existing research benchmarks and compare your system's performance against established baselines.
Supported Datasets
GPQA Diamond
Best for: Generation-only evaluation (pure LLM answer quality)
GPQA (Graduate-Level Google-Proof Q&A) Diamond is a high-quality dataset containing challenging questions across Physics, Chemistry, and Biology that require graduate-level knowledge.
- Questions: Hundreds of high-quality, expert-written questions
- Domains: Physics, Chemistry, Biology
- Use Case: Testing factual accuracy and reasoning capabilities
- Evaluation Type: Generation-only (no retrieval component)
MS MARCO
Best for: Retrieval + generation evaluation
MS MARCO is a large-scale dataset designed for machine reading comprehension and question answering, featuring real queries from Bing search engine.
- Questions: Millions of real user queries
- Passages: Web passages with relevance annotations
- Use Case: Testing both retrieval accuracy and answer generation
- Evaluation Type: Retrieval + generation
Dataset Importer
The BenchmarkDatasetImporter
provides easy access to these datasets:
from vecta import BenchmarkDatasetImporter
importer = BenchmarkDatasetImporter()
# Import GPQA Diamond for generation evaluation
chunks, benchmark_entries = importer.import_gpqa_diamond(
split="train", # Dataset split: "train", "test", "validation"
max_items=50 # Limit number of items for testing
)
# Import MS MARCO for retrieval + generation evaluation
chunks, benchmark_entries = importer.import_msmarco(
split="test",
max_items=100
)
Working with Imported Datasets
Using with VectaClient
from vecta import VectaClient
# Initialize VectaClient
vecta = VectaClient()
# Load benchmark entries from imported dataset
vecta.benchmark_entries = benchmark_entries
# Evaluate your generation function (GPQA Diamond)
def my_generation_function(question: str) -> str:
# Your LLM implementation
response = your_llm.generate(question)
return response
results = vecta.evaluate_generation_only(
my_generation_function,
evaluation_name="GPT-4 on GPQA Diamond"
)
# Evaluate retrieval + generation (MS MARCO)
def my_rag_function(question: str) -> tuple[list[str], str]:
# Your RAG implementation
chunks = your_retriever.search(question)
answer = your_generator.generate(question, chunks)
return chunks, answer
results = vecta.evaluate_retrieval_and_generation(
my_rag_function,
evaluation_name="My RAG on MS MARCO"
)
Using with API Client
from vecta import VectaAPIClient
# Import dataset
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=50)
# Upload to cloud platform (if using cloud evaluation)
client = VectaAPIClient(api_key="your-api-key")
# Create benchmark from imported data
# Note: This would require custom upload functionality
# For now, use local evaluation with imported datasets
Dataset Schemas
Vecta uses schema mapping to handle different dataset formats consistently:
DatasetSchema Structure
from vecta.core.schemas import DatasetSchema
# Example: GPQA Diamond schema
gpqa_schema = DatasetSchema(
question_accessor="Question",
answer_accessor="Correct Answer",
id_accessor="Record ID",
additional_accessors={
"incorrect_answer_1": "Incorrect Answer 1",
"incorrect_answer_2": "Incorrect Answer 2",
"incorrect_answer_3": "Incorrect Answer 3",
"explanation": "Explanation",
"subdomain": "Subdomain",
"high_level_domain": "High-level domain",
}
)
# Example: MS MARCO schema
msmarco_schema = DatasetSchema(
question_accessor="query",
answer_accessor="answers",
id_accessor="query_id",
additional_accessors={
"passages": "passages",
}
)
Schema Field Extraction
The schema system uses DataAccessor patterns to extract fields from various data structures:
"field_name"
- Direct field access"metadata.nested_field"
- Nested field access"[0]"
- Array index access"json(field).subfield"
- JSON parsing and extraction
Dataset-Specific Considerations
GPQA Diamond
Characteristics:
- Graduate-level difficulty
- Expert-validated questions and answers
- Multiple choice format with distractors
- Domain-specific terminology
Evaluation Focus:
- Factual accuracy
- Reasoning capability
- Domain expertise
- Answer completeness
Usage Tips:
# GPQA questions are quite challenging
chunks, entries = importer.import_gpqa_diamond(max_items=25)
# Consider using advanced models for meaningful results
def advanced_generation(question: str) -> str:
# Use your most capable model
return gpt4_model.generate(question)
results = vecta.evaluate_generation_only(advanced_generation)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
MS MARCO
Characteristics:
- Real user queries from search engine
- Web-based passages
- Multiple passages per query
- Relevance annotations
Evaluation Focus:
- Retrieval precision and recall
- Answer relevance to user intent
- Handling of web content
- Multi-passage reasoning
Usage Tips:
# MS MARCO has many passages per query
chunks, entries = importer.import_msmarco(max_items=50)
# Focus on both retrieval and generation quality
def rag_with_reranking(question: str) -> tuple[list[str], str]:
# Initial retrieval
candidates = initial_retriever.search(question, k=20)
# Rerank for relevance
top_chunks = reranker.rerank(question, candidates, k=5)
# Generate answer
answer = generator.generate(question, top_chunks)
return [c.id for c in top_chunks], answer
results = vecta.evaluate_retrieval_and_generation(rag_with_reranking)
Custom Dataset Integration
You can extend the dataset importer for your own datasets:
Creating Custom Schema
from vecta.core.schemas import DatasetSchema, ChunkData, BenchmarkEntry
# Define schema for your dataset format
custom_schema = DatasetSchema(
question_accessor="your_question_field",
answer_accessor="your_answer_field",
id_accessor="your_id_field",
additional_accessors={
"custom_field": "your_custom_field",
}
)
# Process your dataset
def import_custom_dataset(dataset_path: str):
# Load your dataset (CSV, JSON, etc.)
raw_data = load_your_dataset(dataset_path)
chunks = []
benchmark_entries = []
for item in raw_data:
# Extract using schema
fields = custom_schema.extract_dataset_fields(item)
# Create benchmark entry
entry = BenchmarkEntry(
id=str(fields["id"]),
question=fields["question"],
answer=fields["answer"],
chunk_ids=None, # For generation-only tasks
page_nums=None,
source_paths=["custom_dataset"],
)
benchmark_entries.append(entry)
return chunks, benchmark_entries
Integration with Vecta
# Use custom dataset
chunks, entries = import_custom_dataset("my_dataset.json")
# Evaluate with Vecta
vecta = VectaClient()
vecta.benchmark_entries = entries
results = vecta.evaluate_generation_only(my_function)
Evaluation Patterns
Generation-Only Datasets (GPQA)
# Focus on answer quality metrics
results = vecta.evaluate_generation_only(generation_function)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Groundedness: {results.generation_metrics.groundedness:.3f}")
Retrieval + Generation Datasets (MS MARCO)
# Focus on both retrieval and generation
results = vecta.evaluate_retrieval_and_generation(rag_function)
print("Retrieval Performance:")
print(f" Chunk-level F1: {results.chunk_level.f1_score:.3f}")
print(f" Document-level F1: {results.document_level.f1_score:.3f}")
print("Generation Performance:")
print(f" Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f" Groundedness: {results.generation_metrics.groundedness:.3f}")
Dataset Management
Local Storage
# Save imported benchmark for reuse
vecta.benchmark_entries = benchmark_entries
vecta.save_benchmark("gpqa_diamond_benchmark.csv")
# Load for future evaluations
vecta.load_benchmark("gpqa_diamond_benchmark.csv")
Version Control
# Track different dataset versions
datasets = {
"gpqa_diamond_v1": importer.import_gpqa_diamond(max_items=50),
"gpqa_diamond_v2": importer.import_gpqa_diamond(max_items=100),
"msmarco_test": importer.import_msmarco(split="test", max_items=25),
}
# Compare performance across datasets
for name, (chunks, entries) in datasets.items():
vecta.benchmark_entries = entries
results = vecta.evaluate_generation_only(my_function)
print(f"{name}: Accuracy = {results.generation_metrics.accuracy:.3f}")
Best Practices
Dataset Selection
- GPQA Diamond: Use for testing factual accuracy and reasoning on complex topics
- MS MARCO: Use for testing real-world query handling and retrieval effectiveness
- Custom datasets: Create domain-specific benchmarks for specialized applications
Evaluation Strategy
- Start with small samples (25-50 questions) for rapid iteration
- Scale to full datasets for comprehensive evaluation
- Compare against multiple datasets to understand system strengths/weaknesses
Performance Interpretation
- GPQA results indicate pure generation capability
- MS MARCO results indicate end-to-end RAG performance
- Cross-dataset consistency suggests robust system design
This dataset integration capability allows you to leverage established benchmarks while maintaining the flexibility to create custom evaluations tailored to your specific use case.