Retrieval Only Evaluation

Test how well your system finds relevant information. No generation component.

Quick Example

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")

# Define retrieval function
def my_retriever(query: str) -> list[str]:
    results = vector_db.search(query, k=10)
    return [result.id for result in results]

# Run evaluation
results = client.evaluate_retrieval(
    benchmark_id="benchmark-id",
    retrieval_function=my_retriever,
    evaluation_name="Embeddings v2"
)

# View results
print(f"Chunk F1: {results.chunk_level.f1_score:.3f}")
print(f"Page F1: {results.page_level.f1_score:.3f}")
print(f"Doc F1: {results.document_level.f1_score:.3f}")

Metrics Explained

Precision

Of the chunks you retrieved, how many were actually relevant?

# Retrieved 10 chunks, 7 were relevant
precision = 7/10 = 0.70

Recall

Of all relevant chunks, how many did you retrieve?

# 10 relevant chunks exist, retrieved 7
recall = 7/10 = 0.70

F1 Score

Harmonic mean of precision and recall. Single metric for overall quality.

f1 = 2 * (precision * recall) / (precision + recall)

Three Evaluation Levels

results = client.evaluate_retrieval(...)

# Chunk level - Did you find the right paragraphs?
print(f"Chunk - P: {results.chunk_level.precision:.3f}")
print(f"Chunk - R: {results.chunk_level.recall:.3f}")
print(f"Chunk - F1: {results.chunk_level.f1_score:.3f}")

# Page level - Did you find the right pages?
if results.page_level:
    print(f"Page F1: {results.page_level.f1_score:.3f}")

# Document level - Did you find the right documents?
print(f"Doc F1: {results.document_level.f1_score:.3f}")

Local SDK

from vecta import VectaClient

vecta = VectaClient(vector_db_connector=connector)
vecta.load_benchmark("benchmark.csv")

def my_retriever(query: str) -> list[str]:
    chunks = connector.semantic_search(query, k=10)
    return [c.id for c in chunks]

results = vecta.evaluate_retrieval(my_retriever)

Common Use Cases

Compare Embedding Models

# Test different embeddings
models = ["text-embedding-3-small", "text-embedding-3-large"]

for model in models:
    def retriever(query: str) -> list[str]:
        embedding = embed(query, model=model)
        results = vector_db.search(embedding, k=10)
        return [r.id for r in results]
    
    results = client.evaluate_retrieval(
        benchmark_id=benchmark_id,
        retrieval_function=retriever,
        evaluation_name=f"Eval with {model}"
    )
    print(f"{model}: F1 = {results.chunk_level.f1_score:.3f}")

Tune Top-K

# Find optimal k value
for k in [5, 10, 15, 20]:
    def retriever(query: str) -> list[str]:
        results = vector_db.search(query, k=k)
        return [r.id for r in results]
    
    results = client.evaluate_retrieval(
        benchmark_id=benchmark_id,
        retrieval_function=retriever,
        evaluation_name=f"k={k}"
    )

Test Hybrid Search

def hybrid_retriever(query: str) -> list[str]:
    # Vector search
    vector_results = vector_db.search(query, k=20)
    
    # Keyword search
    keyword_results = bm25_search(query, k=20)
    
    # Rerank combined results
    combined = rerank(vector_results + keyword_results)
    return [c.id for c in combined[:10]]

results = client.evaluate_retrieval(
    benchmark_id=benchmark_id,
    retrieval_function=hybrid_retriever,
    evaluation_name="Hybrid Search"
)

Interpreting Results

High Precision, Low Recall:

System is conservative
Returns very relevant results but misses some
Fix: Increase k, lower similarity threshold

Low Precision, High Recall:

System is aggressive
Returns many results including irrelevant ones
Fix: Decrease k, raise similarity threshold, add reranking

Balanced F1:

Good overall performance
Balance between precision and recall

Next Steps

Generation Only → - Test LLM quality
Retrieval + Generation → - Test full pipeline