Hugging Face
Use GPQA Diamond, MS MARCO, and other datasets
Hugging Face Datasets
Import popular evaluation datasets from Hugging Face. Perfect for standardized benchmarking and research comparison.

Figure: Pull gold-standard datasets directly from the Hugging Face Hub without leaving Vecta.
Supported Datasets
GPQA Diamond
Graduate-level science questions - Tests reasoning and factual accuracy.

Figure: GPQA Diamond challenges your models with graduate-level science questions.
from vecta.core.dataset_importer import BenchmarkDatasetImporter
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(
split="train",
max_items=50
)
Use for: Generation-only evaluation
Domains: Physics, Chemistry, Biology
Difficulty: Graduate-level
MS MARCO
Real search queries - Tests retrieval and generation with web passages.

Figure: MS MARCO measures real-world retrieval and question answering performance.
chunks, entries = importer.import_msmarco(
split="test",
max_items=100
)
Use for: Retrieval + generation evaluation
Domains: General web content
Difficulty: Mixed
Quick Example
from vecta import VectaClient
from vecta.core.dataset_importer import BenchmarkDatasetImporter
# Import dataset
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=25)
# Load into Vecta
vecta = VectaClient(openai_api_key="your-key")
vecta.benchmark_entries = entries
# Evaluate
def my_generator(query: str) -> str:
return llm.generate(query)
results = vecta.evaluate_generation_only(my_generator)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
Via Web Dashboard
Import directly from the UI:
- Go to Platform → Benchmarks → Create
- Click Import from Hugging Face
- Choose dataset (GPQA Diamond or MS MARCO)
- Set number of questions
- Click Import
Dataset Details
GPQA Diamond
# Returns
chunks = [] # No chunks (generation-only)
entries = [
BenchmarkEntry(
question="What is the role of cytochrome P450...",
answer="Cytochrome P450 enzymes...",
chunk_ids=None, # No retrieval component
source_paths=["Chemistry"]
)
]
When to use:
- Testing pure LLM reasoning
- Comparing model knowledge
- Scientific domain evaluation
MS MARCO
# Returns
chunks = [...] # Web passages
entries = [
BenchmarkEntry(
question="how to reset windows password",
answer="You can reset...",
chunk_ids=["passage_1", "passage_2"],
page_nums=None, # No page concept
source_paths=["https://example.com/help"]
)
]
When to use:
- Testing retrieval accuracy
- Real-world query handling
- Web content evaluation
Save for Reuse
# Import once
chunks, entries = importer.import_gpqa_diamond(max_items=50)
# Save to CSV
vecta.benchmark_entries = entries
vecta.save_benchmark("gpqa_50.csv")
# Load later
vecta.load_benchmark("gpqa_50.csv")
Limitations
GPQA Diamond:
- Generation-only (no retrieval testing)
- Requires strong models (difficult questions)
MS MARCO:
- Requires ingesting passages first
- Web content may differ from your domain
- No page numbers (web passages)
Best Practices
GPQA Diamond:
- Use for baseline model comparison
- Test with your best LLM first
- Expect lower accuracy than domain-specific benchmarks
MS MARCO:
- Ingest passages into your vector database first
- Use for retrieval algorithm comparison
- Adapt evaluation thresholds for web content
Next Steps
- Synthetic Generation → - Create domain-specific benchmarks
- Evaluations → - Run evaluations with imported datasets