Back to Docs
Evaluations
Retrieval Only
Test search quality without generation
Last updated: August 19, 2025
Category: evaluations
Retrieval Only Evaluation
Test how well your system finds relevant information. No generation component.
Quick Example
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
# Define retrieval function
def my_retriever(query: str) -> list[str]:
results = vector_db.search(query, k=10)
return [result.id for result in results]
# Run evaluation
results = client.evaluate_retrieval(
benchmark_id="benchmark-id",
retrieval_function=my_retriever,
evaluation_name="Embeddings v2"
)
# View results
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")
print(f"Page F1: {results.page_level.f1_score:.3f}")
print(f"Doc F1: {results.document_level.f1_score:.3f}")
Metrics Explained
Precision
Of the chunks you retrieved, how many were actually relevant?
# Retrieved 10 chunks, 7 were relevant
precision = 7/10 = 0.70
Recall
Of all relevant chunks, how many did you retrieve?
# 10 relevant chunks exist, retrieved 7
recall = 7/10 = 0.70
F1 Score
Harmonic mean of precision and recall. Single metric for overall quality.
f1 = 2 * (precision * recall) / (precision + recall)
Three Evaluation Levels
results = client.evaluate_retrieval(...)
# Chunk level - Did you find the right paragraphs?
print(f"Chunk - P: {results.chunk_level.precision:.3f}")
print(f"Chunk - R: {results.chunk_level.recall:.3f}")
print(f"Chunk - F1: {results.chunk_level.f1_score:.3f}")
# Page level - Did you find the right pages?
if results.page_level:
print(f"Page F1: {results.page_level.f1_score:.3f}")
# Document level - Did you find the right documents?
print(f"Doc F1: {results.document_level.f1_score:.3f}")
Local SDK
from vecta import VectaClient
vecta = VectaClient(vector_db_connector=connector)
vecta.load_benchmark("benchmark.csv")
def my_retriever(query: str) -> list[str]:
chunks = connector.semantic_search(query, k=10)
return [c.id for c in chunks]
results = vecta.evaluate_retrieval(my_retriever)
Common Use Cases
Compare Embedding Models
# Test different embeddings
models = ["text-embedding-3-small", "text-embedding-3-large"]
for model in models:
def retriever(query: str) -> list[str]:
embedding = embed(query, model=model)
results = vector_db.search(embedding, k=10)
return [r.id for r in results]
results = client.evaluate_retrieval(
benchmark_id=benchmark_id,
retrieval_function=retriever,
evaluation_name=f"Eval with {model}"
)
print(f"{model}: F1 = {results.chunk_level.f1_score:.3f}")
Tune Top-K
# Find optimal k value
for k in [5, 10, 15, 20]:
def retriever(query: str) -> list[str]:
results = vector_db.search(query, k=k)
return [r.id for r in results]
results = client.evaluate_retrieval(
benchmark_id=benchmark_id,
retrieval_function=retriever,
evaluation_name=f"k={k}"
)
Test Hybrid Search
def hybrid_retriever(query: str) -> list[str]:
# Vector search
vector_results = vector_db.search(query, k=20)
# Keyword search
keyword_results = bm25_search(query, k=20)
# Rerank combined results
combined = rerank(vector_results + keyword_results)
return [c.id for c in combined[:10]]
results = client.evaluate_retrieval(
benchmark_id=benchmark_id,
retrieval_function=hybrid_retriever,
evaluation_name="Hybrid Search"
)
Interpreting Results
High Precision, Low Recall:
- System is conservative
- Returns very relevant results but misses some
- Fix: Increase k, lower similarity threshold
Low Precision, High Recall:
- System is aggressive
- Returns many results including irrelevant ones
- Fix: Decrease k, raise similarity threshold, add reranking
Balanced F1:
- Good overall performance
- Balance between precision and recall
Next Steps
- Generation Only → - Test LLM quality
- Retrieval + Generation → - Test full pipeline