Back to Docs
Benchmarks
Overview
What are benchmarks and how to use them
Last updated: August 19, 2025
Category: benchmarks
Benchmarks
Benchmarks are test datasets containing questions, answers, and expected chunks/pages/documents. They power all Vecta evaluations.
What's in a Benchmark?
Each benchmark entry contains:
{
"question": "What is the API rate limit?",
"answer": "1000 requests per minute",
"chunk_ids": ["chunk_1", "chunk_2"],
"page_nums": [15, 16],
"source_paths": ["api_docs.pdf"]
}
Three Ways to Create Benchmarks
1. Synthetic Generation (Recommended)
Auto-generate from your data:
benchmark = client.create_benchmark(
data_source_id="your-db-id",
questions_count=100
)
2. CSV Upload
Import existing Q&A datasets:
# Format: question, answer, chunk_ids, page_nums, source_paths
vecta.load_benchmark("your_benchmark.csv")
3. Hugging Face Datasets
Use popular evaluation datasets:
from vecta.core.dataset_importer import BenchmarkDatasetImporter
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=50)
Quick Example
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
# Connect database
db = client.connect_pinecone(api_key="...", index="docs")
# Generate benchmark
benchmark = client.create_benchmark(
data_source_id=db["id"],
questions_count=50,
wait_for_completion=True
)
# Use in evaluations
results = client.evaluate_retrieval(
benchmark_id=benchmark["id"],
retrieval_function=my_retriever
)
Benchmark Quality
Good benchmarks have:
- Coverage: Questions across all document types
- Difficulty range: Mix of easy and hard questions
- Multi-hop: Some questions require multiple chunks
- Ground truth accuracy: All relevant chunks identified
Managing Benchmarks
# List all benchmarks
benchmarks = client.list_benchmarks()
# Get details
benchmark = client.get_benchmark("benchmark-id")
# Export to CSV
csv = client.export_benchmark("benchmark-id")
# Delete
client.delete_benchmark("benchmark-id")
Best Practices
Start small: 25-50 questions for initial testing
Scale up: 100-500 questions for comprehensive evaluation
Update regularly: Regenerate when your data changes significantly
Multiple benchmarks: Create separate benchmarks for different domains
Next Steps
- Synthetic Generation → - Auto-generate from your data
- CSV Upload → - Import existing datasets
- Hugging Face → - Use popular datasets
- Evaluations → - Run tests with your benchmark