Back to Docs
Benchmarks

Overview

What are benchmarks and how to use them

Last updated: August 19, 2025
Category: benchmarks

Benchmarks

Benchmarks are test datasets containing questions, answers, and expected chunks/pages/documents. They power all Vecta evaluations.

What's in a Benchmark?

Each benchmark entry contains:

{
    "question": "What is the API rate limit?",
    "answer": "1000 requests per minute",
    "chunk_ids": ["chunk_1", "chunk_2"],
    "page_nums": [15, 16],
    "source_paths": ["api_docs.pdf"]
}

Three Ways to Create Benchmarks

1. Synthetic Generation (Recommended)

Auto-generate from your data:

benchmark = client.create_benchmark(
    data_source_id="your-db-id",
    questions_count=100
)

Learn more →

2. CSV Upload

Import existing Q&A datasets:

# Format: question, answer, chunk_ids, page_nums, source_paths
vecta.load_benchmark("your_benchmark.csv")

Learn more →

3. Hugging Face Datasets

Use popular evaluation datasets:

from vecta.core.dataset_importer import BenchmarkDatasetImporter

importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=50)

Learn more →

Quick Example

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")

# Connect database
db = client.connect_pinecone(api_key="...", index="docs")

# Generate benchmark
benchmark = client.create_benchmark(
    data_source_id=db["id"],
    questions_count=50,
    wait_for_completion=True
)

# Use in evaluations
results = client.evaluate_retrieval(
    benchmark_id=benchmark["id"],
    retrieval_function=my_retriever
)

Benchmark Quality

Good benchmarks have:

  • Coverage: Questions across all document types
  • Difficulty range: Mix of easy and hard questions
  • Multi-hop: Some questions require multiple chunks
  • Ground truth accuracy: All relevant chunks identified

Managing Benchmarks

# List all benchmarks
benchmarks = client.list_benchmarks()

# Get details
benchmark = client.get_benchmark("benchmark-id")

# Export to CSV
csv = client.export_benchmark("benchmark-id")

# Delete
client.delete_benchmark("benchmark-id")

Best Practices

Start small: 25-50 questions for initial testing

Scale up: 100-500 questions for comprehensive evaluation

Update regularly: Regenerate when your data changes significantly

Multiple benchmarks: Create separate benchmarks for different domains

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.