Back to Docs
Evaluations

Generation Only

Score LLM response accuracy and groundedness

Last updated: August 20, 2025
Category: evaluations

Generation-Only Evaluation

A generation-only evaluation measures LLM output quality — accuracy and groundedness — without assessing retrieval. This is useful for testing your LLM's reasoning ability, validating prompt engineering changes, or evaluating on datasets like GPQA Diamond that have no retrieval component.

Function Signature

Your generation function must accept a query string and return the generated answer:

def my_generator(query: str) -> str:
    """Return a generated answer for the query."""
    response = llm.generate(query)
    return response

Using the API Client

from vecta import VectaAPIClient

client = VectaAPIClient()

def my_generator(query: str) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content

results = client.evaluate_generation_only(
    benchmark_id="your-benchmark-id",
    generation_function=my_generator,
    evaluation_name="gpt-4o-baseline",
    metadata={"model": "gpt-4o", "temperature": 0.0},
)
ParameterTypeDefaultDescription
benchmark_idstrrequiredID of an active benchmark
generation_functionCallable[[str], str]requiredYour generation function
evaluation_namestr"API Generation Evaluation"Human-readable name
experiment_idstr | NoneNoneOptional experiment to group under
metadatadict | NoneNoneArbitrary key-value metadata

Returns: GenerationOnlyResults

Using the Local Client

from vecta import VectaClient

vecta = VectaClient(
    data_source_connector=None,  # not needed for generation-only
    openai_api_key="sk-...",     # needed for LLM-as-judge scoring
)
vecta.load_benchmark("my_benchmark.csv")

results = vecta.evaluate_generation_only(
    my_generator,
    evaluation_name="gpt-4o-baseline",
)

Note: When using the local client, you must provide an openai_api_key to power the LLM-as-a-judge scoring. Without it, accuracy and groundedness will default to 0.

How Scoring Works

For each question, Vecta sends the question, expected answer, and generated answer to an LLM judge that returns structured JSON scores:

  • Accuracy (0–1) — How well does the generated answer match the expected answer semantically?
  • Groundedness (0–1) — Is the answer self-consistent and free from unsupported claims?

Scoring is done in parallel batches of 8 for speed.

Reading Results

print(f"Accuracy:     {results.generation_metrics.accuracy:.2%}")
print(f"Groundedness: {results.generation_metrics.groundedness:.2%}")
print(f"Duration:     {results.duration_seconds}s")

# Per-question details
for row in results.detailed_results:
    print(f"Q: {row.question[:60]}...")
    print(f"  Expected:  {row.expected_answer[:60]}...")
    print(f"  Generated: {row.generated_answer[:60]}...")
    print(f"  Accuracy:      {row.accuracy:.2%}")
    print(f"  Groundedness:  {row.groundedness:.2%}")

Result Schema

The GenerationOnlyResults object contains:

FieldTypeDescription
generation_metricsGenerationMetricsAccuracy and groundedness averages
detailed_resultsList[EvaluationResultRow]Per-question scores
total_questionsintNumber of questions evaluated
duration_secondsintWall-clock time
evaluation_namestrName of this run
metadatadict | NoneAttached metadata

Tips

  • Benchmarks don't need chunk_ids — Generation-only evaluation only requires question and answer fields.
  • GPQA Diamond is a great starting point for generation-only evaluation — import it from Hugging Face.
  • Compare models — Run the same benchmark against different LLMs and attach metadata={"model": "..."} to each run for easy comparison in Experiments.

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.