Back to Docs
Data & Benchmarks

Datasets

Working with evaluation datasets and imports

Last updated: August 19, 2025
Category: data

Datasets

Vecta supports importing and working with popular evaluation datasets for RAG system benchmarking. This allows you to leverage existing research benchmarks and compare your system's performance against established baselines.

Supported Datasets

GPQA Diamond

Best for: Generation-only evaluation (pure LLM answer quality)

GPQA (Graduate-Level Google-Proof Q&A) Diamond is a high-quality dataset containing challenging questions across Physics, Chemistry, and Biology that require graduate-level knowledge.

  • Questions: Hundreds of high-quality, expert-written questions
  • Domains: Physics, Chemistry, Biology
  • Use Case: Testing factual accuracy and reasoning capabilities
  • Evaluation Type: Generation-only (no retrieval component)

MS MARCO

Best for: Retrieval + generation evaluation

MS MARCO is a large-scale dataset designed for machine reading comprehension and question answering, featuring real queries from Bing search engine.

  • Questions: Millions of real user queries
  • Passages: Web passages with relevance annotations
  • Use Case: Testing both retrieval accuracy and answer generation
  • Evaluation Type: Retrieval + generation

Dataset Importer

The BenchmarkDatasetImporter provides easy access to these datasets:

from vecta import BenchmarkDatasetImporter

importer = BenchmarkDatasetImporter()

# Import GPQA Diamond for generation evaluation
chunks, benchmark_entries = importer.import_gpqa_diamond(
    split="train",        # Dataset split: "train", "test", "validation"
    max_items=50         # Limit number of items for testing
)

# Import MS MARCO for retrieval + generation evaluation
chunks, benchmark_entries = importer.import_msmarco(
    split="test",
    max_items=100
)

Working with Imported Datasets

Using with VectaClient

from vecta import VectaClient

# Initialize VectaClient
vecta = VectaClient()

# Load benchmark entries from imported dataset
vecta.benchmark_entries = benchmark_entries

# Evaluate your generation function (GPQA Diamond)
def my_generation_function(question: str) -> str:
    # Your LLM implementation
    response = your_llm.generate(question)
    return response

results = vecta.evaluate_generation_only(
    my_generation_function,
    evaluation_name="GPT-4 on GPQA Diamond"
)

# Evaluate retrieval + generation (MS MARCO)
def my_rag_function(question: str) -> tuple[list[str], str]:
    # Your RAG implementation
    chunks = your_retriever.search(question)
    answer = your_generator.generate(question, chunks)
    return chunks, answer

results = vecta.evaluate_retrieval_and_generation(
    my_rag_function,
    evaluation_name="My RAG on MS MARCO"
)

Using with API Client

from vecta import VectaAPIClient

# Import dataset
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=50)

# Upload to cloud platform (if using cloud evaluation)
client = VectaAPIClient(api_key="your-api-key")

# Create benchmark from imported data
# Note: This would require custom upload functionality
# For now, use local evaluation with imported datasets

Dataset Schemas

Vecta uses schema mapping to handle different dataset formats consistently:

DatasetSchema Structure

from vecta.core.schemas import DatasetSchema

# Example: GPQA Diamond schema
gpqa_schema = DatasetSchema(
    question_accessor="Question",
    answer_accessor="Correct Answer", 
    id_accessor="Record ID",
    additional_accessors={
        "incorrect_answer_1": "Incorrect Answer 1",
        "incorrect_answer_2": "Incorrect Answer 2", 
        "incorrect_answer_3": "Incorrect Answer 3",
        "explanation": "Explanation",
        "subdomain": "Subdomain",
        "high_level_domain": "High-level domain",
    }
)

# Example: MS MARCO schema
msmarco_schema = DatasetSchema(
    question_accessor="query",
    answer_accessor="answers",
    id_accessor="query_id",
    additional_accessors={
        "passages": "passages",
    }
)

Schema Field Extraction

The schema system uses DataAccessor patterns to extract fields from various data structures:

  • "field_name" - Direct field access
  • "metadata.nested_field" - Nested field access
  • "[0]" - Array index access
  • "json(field).subfield" - JSON parsing and extraction

Dataset-Specific Considerations

GPQA Diamond

Characteristics:

  • Graduate-level difficulty
  • Expert-validated questions and answers
  • Multiple choice format with distractors
  • Domain-specific terminology

Evaluation Focus:

  • Factual accuracy
  • Reasoning capability
  • Domain expertise
  • Answer completeness

Usage Tips:

# GPQA questions are quite challenging
chunks, entries = importer.import_gpqa_diamond(max_items=25)

# Consider using advanced models for meaningful results
def advanced_generation(question: str) -> str:
    # Use your most capable model
    return gpt4_model.generate(question)

results = vecta.evaluate_generation_only(advanced_generation)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")

MS MARCO

Characteristics:

  • Real user queries from search engine
  • Web-based passages
  • Multiple passages per query
  • Relevance annotations

Evaluation Focus:

  • Retrieval precision and recall
  • Answer relevance to user intent
  • Handling of web content
  • Multi-passage reasoning

Usage Tips:

# MS MARCO has many passages per query
chunks, entries = importer.import_msmarco(max_items=50)

# Focus on both retrieval and generation quality
def rag_with_reranking(question: str) -> tuple[list[str], str]:
    # Initial retrieval
    candidates = initial_retriever.search(question, k=20)
    
    # Rerank for relevance
    top_chunks = reranker.rerank(question, candidates, k=5)
    
    # Generate answer
    answer = generator.generate(question, top_chunks)
    
    return [c.id for c in top_chunks], answer

results = vecta.evaluate_retrieval_and_generation(rag_with_reranking)

Custom Dataset Integration

You can extend the dataset importer for your own datasets:

Creating Custom Schema

from vecta.core.schemas import DatasetSchema, ChunkData, BenchmarkEntry

# Define schema for your dataset format
custom_schema = DatasetSchema(
    question_accessor="your_question_field",
    answer_accessor="your_answer_field", 
    id_accessor="your_id_field",
    additional_accessors={
        "custom_field": "your_custom_field",
    }
)

# Process your dataset
def import_custom_dataset(dataset_path: str):
    # Load your dataset (CSV, JSON, etc.)
    raw_data = load_your_dataset(dataset_path)
    
    chunks = []
    benchmark_entries = []
    
    for item in raw_data:
        # Extract using schema
        fields = custom_schema.extract_dataset_fields(item)
        
        # Create benchmark entry
        entry = BenchmarkEntry(
            id=str(fields["id"]),
            question=fields["question"],
            answer=fields["answer"],
            chunk_ids=None,  # For generation-only tasks
            page_nums=None,
            source_paths=["custom_dataset"],
        )
        benchmark_entries.append(entry)
    
    return chunks, benchmark_entries

Integration with Vecta

# Use custom dataset
chunks, entries = import_custom_dataset("my_dataset.json")

# Evaluate with Vecta
vecta = VectaClient()
vecta.benchmark_entries = entries

results = vecta.evaluate_generation_only(my_function)

Evaluation Patterns

Generation-Only Datasets (GPQA)

# Focus on answer quality metrics
results = vecta.evaluate_generation_only(generation_function)

print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Groundedness: {results.generation_metrics.groundedness:.3f}")

Retrieval + Generation Datasets (MS MARCO)

# Focus on both retrieval and generation
results = vecta.evaluate_retrieval_and_generation(rag_function)

print("Retrieval Performance:")
print(f"  Chunk-level F1: {results.chunk_level.f1_score:.3f}")
print(f"  Document-level F1: {results.document_level.f1_score:.3f}")

print("Generation Performance:")
print(f"  Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"  Groundedness: {results.generation_metrics.groundedness:.3f}")

Dataset Management

Local Storage

# Save imported benchmark for reuse
vecta.benchmark_entries = benchmark_entries
vecta.save_benchmark("gpqa_diamond_benchmark.csv")

# Load for future evaluations
vecta.load_benchmark("gpqa_diamond_benchmark.csv")

Version Control

# Track different dataset versions
datasets = {
    "gpqa_diamond_v1": importer.import_gpqa_diamond(max_items=50),
    "gpqa_diamond_v2": importer.import_gpqa_diamond(max_items=100),
    "msmarco_test": importer.import_msmarco(split="test", max_items=25),
}

# Compare performance across datasets
for name, (chunks, entries) in datasets.items():
    vecta.benchmark_entries = entries
    results = vecta.evaluate_generation_only(my_function)
    print(f"{name}: Accuracy = {results.generation_metrics.accuracy:.3f}")

Best Practices

Dataset Selection

  • GPQA Diamond: Use for testing factual accuracy and reasoning on complex topics
  • MS MARCO: Use for testing real-world query handling and retrieval effectiveness
  • Custom datasets: Create domain-specific benchmarks for specialized applications

Evaluation Strategy

  • Start with small samples (25-50 questions) for rapid iteration
  • Scale to full datasets for comprehensive evaluation
  • Compare against multiple datasets to understand system strengths/weaknesses

Performance Interpretation

  • GPQA results indicate pure generation capability
  • MS MARCO results indicate end-to-end RAG performance
  • Cross-dataset consistency suggests robust system design

This dataset integration capability allows you to leverage established benchmarks while maintaining the flexibility to create custom evaluations tailored to your specific use case.

Need Help?

Can't find what you're looking for? Our team is here to help.