Back to Docs
Benchmarks

Hugging Face

Use GPQA Diamond, MS MARCO, and other datasets

Last updated: August 19, 2025
Category: benchmarks

Hugging Face Datasets

Import popular evaluation datasets from Hugging Face. Perfect for standardized benchmarking and research comparison.

Hugging Face logo

Figure: Pull gold-standard datasets directly from the Hugging Face Hub without leaving Vecta.

Supported Datasets

GPQA Diamond

Graduate-level science questions - Tests reasoning and factual accuracy.

GPQA Diamond dataset illustration

Figure: GPQA Diamond challenges your models with graduate-level science questions.

from vecta.core.dataset_importer import BenchmarkDatasetImporter

importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(
    split="train",
    max_items=50
)

Use for: Generation-only evaluation
Domains: Physics, Chemistry, Biology
Difficulty: Graduate-level

MS MARCO

Real search queries - Tests retrieval and generation with web passages.

MS MARCO dataset illustration

Figure: MS MARCO measures real-world retrieval and question answering performance.

chunks, entries = importer.import_msmarco(
    split="test",
    max_items=100
)

Use for: Retrieval + generation evaluation
Domains: General web content
Difficulty: Mixed

Quick Example

from vecta import VectaClient
from vecta.core.dataset_importer import BenchmarkDatasetImporter

# Import dataset
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=25)

# Load into Vecta
vecta = VectaClient(openai_api_key="your-key")
vecta.benchmark_entries = entries

# Evaluate
def my_generator(query: str) -> str:
    return llm.generate(query)

results = vecta.evaluate_generation_only(my_generator)
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")

Via Web Dashboard

Import directly from the UI:

  1. Go to Platform → Benchmarks → Create
  2. Click Import from Hugging Face
  3. Choose dataset (GPQA Diamond or MS MARCO)
  4. Set number of questions
  5. Click Import

Dataset Details

GPQA Diamond

# Returns
chunks = []  # No chunks (generation-only)
entries = [
    BenchmarkEntry(
        question="What is the role of cytochrome P450...",
        answer="Cytochrome P450 enzymes...",
        chunk_ids=None,  # No retrieval component
        source_paths=["Chemistry"]
    )
]

When to use:

  • Testing pure LLM reasoning
  • Comparing model knowledge
  • Scientific domain evaluation

MS MARCO

# Returns
chunks = [...]  # Web passages
entries = [
    BenchmarkEntry(
        question="how to reset windows password",
        answer="You can reset...",
        chunk_ids=["passage_1", "passage_2"],
        page_nums=None,  # No page concept
        source_paths=["https://example.com/help"]
    )
]

When to use:

  • Testing retrieval accuracy
  • Real-world query handling
  • Web content evaluation

Save for Reuse

# Import once
chunks, entries = importer.import_gpqa_diamond(max_items=50)

# Save to CSV
vecta.benchmark_entries = entries
vecta.save_benchmark("gpqa_50.csv")

# Load later
vecta.load_benchmark("gpqa_50.csv")

Limitations

GPQA Diamond:

  • Generation-only (no retrieval testing)
  • Requires strong models (difficult questions)

MS MARCO:

  • Requires ingesting passages first
  • Web content may differ from your domain
  • No page numbers (web passages)

Best Practices

GPQA Diamond:

  • Use for baseline model comparison
  • Test with your best LLM first
  • Expect lower accuracy than domain-specific benchmarks

MS MARCO:

  • Ingest passages into your vector database first
  • Use for retrieval algorithm comparison
  • Adapt evaluation thresholds for web content

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.