Back to Docs
Evaluations
Generation Only
Test LLM response quality
Last updated: August 19, 2025
Category: evaluations
Generation Only Evaluation
Test LLM response quality without retrieval. Perfect for prompt engineering and model comparison.
Quick Example
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
# Define generation function
def my_generator(query: str) -> str:
return llm.chat(query)
# Run evaluation
results = client.evaluate_generation_only(
benchmark_id="benchmark-id",
generation_function=my_generator,
evaluation_name="GPT-4o Mini"
)
# View results
print(f"Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f"Groundedness: {results.generation_metrics.groundedness:.3f}")
Metrics Explained
Accuracy
Does the generated answer match the expected answer semantically?
# Question: "What is the API rate limit?"
# Expected: "1000 requests per minute"
# Generated: "The limit is 1000 req/min"
# Accuracy: 0.95 (high semantic similarity)
Groundedness
Is the answer internally consistent and truthful?
# Question: "What is the capital of France?"
# Generated: "Paris is the capital of France"
# Groundedness: 1.0 (factually correct, no contradictions)
# Generated: "The capital is Paris, which is in Germany"
# Groundedness: 0.3 (contains contradiction)
Local SDK
from vecta import VectaClient
vecta = VectaClient(openai_api_key="your-key")
vecta.load_benchmark("benchmark.csv")
def my_generator(query: str) -> str:
return llm.generate(query)
results = vecta.evaluate_generation_only(my_generator)
Common Use Cases
Compare LLM Models
models = ["gpt-4o-mini", "gpt-4", "claude-3-5-sonnet"]
for model in models:
def generator(query: str) -> str:
return llm.generate(query, model=model)
results = client.evaluate_generation_only(
benchmark_id=benchmark_id,
generation_function=generator,
evaluation_name=model
)
print(f"{model}:")
print(f" Accuracy: {results.generation_metrics.accuracy:.3f}")
print(f" Groundedness: {results.generation_metrics.groundedness:.3f}")
Test Prompts
prompts = [
"Answer concisely:",
"Provide a detailed explanation:",
"Answer in one sentence:"
]
for prompt in prompts:
def generator(query: str) -> str:
return llm.generate(f"{prompt}\n\n{query}")
results = client.evaluate_generation_only(
benchmark_id=benchmark_id,
generation_function=generator,
evaluation_name=f"Prompt: {prompt[:20]}"
)
Validate Fine-tuning
# Before fine-tuning
def base_model(query: str) -> str:
return llm.generate(query, model="gpt-3.5-turbo")
base_results = client.evaluate_generation_only(
benchmark_id=benchmark_id,
generation_function=base_model,
evaluation_name="Base Model"
)
# After fine-tuning
def finetuned_model(query: str) -> str:
return llm.generate(query, model="ft:gpt-3.5-turbo:...")
ft_results = client.evaluate_generation_only(
benchmark_id=benchmark_id,
generation_function=finetuned_model,
evaluation_name="Fine-tuned Model"
)
# Compare
print(f"Accuracy improvement: {ft_results.generation_metrics.accuracy - base_results.generation_metrics.accuracy:.3f}")
With Hugging Face Datasets
Perfect for testing on standardized benchmarks:
from vecta.core.dataset_importer import BenchmarkDatasetImporter
# Import GPQA Diamond
importer = BenchmarkDatasetImporter()
chunks, entries = importer.import_gpqa_diamond(max_items=50)
# Load benchmark
vecta = VectaClient(openai_api_key="your-key")
vecta.benchmark_entries = entries
# Test your model
def my_generator(query: str) -> str:
return llm.generate(query)
results = vecta.evaluate_generation_only(my_generator)
Interpreting Results
High Accuracy:
- Answers match expected answers well
- Good semantic understanding
Low Accuracy:
- Answers don't match expected
- May indicate prompt issues or wrong model
High Groundedness:
- No hallucinations
- Internally consistent responses
Low Groundedness:
- Contains contradictions or false info
- May need better system prompts or different model
Next Steps
- Retrieval Only → - Test search quality
- Retrieval + Generation → - Test full pipeline