Generation Only
Score LLM response accuracy and groundedness
Generation-Only Evaluation
A generation-only evaluation measures LLM output quality — accuracy and groundedness — without assessing retrieval. This is useful for testing your LLM's reasoning ability, validating prompt engineering changes, or evaluating on datasets like GPQA Diamond that have no retrieval component.
Function Signature
Your generation function must accept a query string and return the generated answer:
def my_generator(query: str) -> str:
"""Return a generated answer for the query."""
response = llm.generate(query)
return response
Using the API Client
from vecta import VectaAPIClient
client = VectaAPIClient()
def my_generator(query: str) -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}],
)
return response.choices[0].message.content
results = client.evaluate_generation_only(
benchmark_id="your-benchmark-id",
generation_function=my_generator,
evaluation_name="gpt-4o-baseline",
metadata={"model": "gpt-4o", "temperature": 0.0},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
benchmark_id | str | required | ID of an active benchmark |
generation_function | Callable[[str], str] | required | Your generation function |
evaluation_name | str | "API Generation Evaluation" | Human-readable name |
experiment_id | str | None | None | Optional experiment to group under |
metadata | dict | None | None | Arbitrary key-value metadata |
Returns: GenerationOnlyResults
Using the Local Client
from vecta import VectaClient
vecta = VectaClient(
data_source_connector=None, # not needed for generation-only
openai_api_key="sk-...", # needed for LLM-as-judge scoring
)
vecta.load_benchmark("my_benchmark.csv")
results = vecta.evaluate_generation_only(
my_generator,
evaluation_name="gpt-4o-baseline",
)
Note: When using the local client, you must provide an
openai_api_keyto power the LLM-as-a-judge scoring. Without it, accuracy and groundedness will default to 0.
How Scoring Works
For each question, Vecta sends the question, expected answer, and generated answer to an LLM judge that returns structured JSON scores:
- Accuracy (0–1) — How well does the generated answer match the expected answer semantically?
- Groundedness (0–1) — Is the answer self-consistent and free from unsupported claims?
Scoring is done in parallel batches of 8 for speed.
Reading Results
print(f"Accuracy: {results.generation_metrics.accuracy:.2%}")
print(f"Groundedness: {results.generation_metrics.groundedness:.2%}")
print(f"Duration: {results.duration_seconds}s")
# Per-question details
for row in results.detailed_results:
print(f"Q: {row.question[:60]}...")
print(f" Expected: {row.expected_answer[:60]}...")
print(f" Generated: {row.generated_answer[:60]}...")
print(f" Accuracy: {row.accuracy:.2%}")
print(f" Groundedness: {row.groundedness:.2%}")
Result Schema
The GenerationOnlyResults object contains:
| Field | Type | Description |
|---|---|---|
generation_metrics | GenerationMetrics | Accuracy and groundedness averages |
detailed_results | List[EvaluationResultRow] | Per-question scores |
total_questions | int | Number of questions evaluated |
duration_seconds | int | Wall-clock time |
evaluation_name | str | Name of this run |
metadata | dict | None | Attached metadata |
Tips
- Benchmarks don't need
chunk_ids— Generation-only evaluation only requiresquestionandanswerfields. - GPQA Diamond is a great starting point for generation-only evaluation — import it from Hugging Face.
- Compare models — Run the same benchmark against different LLMs and attach
metadata={"model": "..."}to each run for easy comparison in Experiments.
Next Steps
- Retrieval Only — Evaluate search quality independently
- Retrieval + Generation — Evaluate your full RAG pipeline
- Experiments — Compare generation results across models