Back to Docs
Advanced
CI/CD Integration
Run evaluations in GitHub Actions and catch regressions
Last updated: August 20, 2025
Category: advanced
CI/CD Integration
Run Vecta evaluations automatically in your CI/CD pipeline to catch regressions before they reach production. This guide shows how to set up GitHub Actions, but the same approach works with any CI runner.
GitHub Actions Workflow
Create .github/workflows/vecta-eval.yml:
name: RAG Evaluation
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install vecta
pip install -r requirements.txt
- name: Run evaluation
env:
VECTA_API_KEY: ${{ secrets.VECTA_API_KEY }}
run: python scripts/evaluate.py
Evaluation Script
Create scripts/evaluate.py:
#!/usr/bin/env python3
"""CI evaluation script that fails on regression."""
import sys
from vecta import VectaAPIClient
# Thresholds — adjust to your requirements
MIN_CHUNK_F1 = 0.70
MIN_ACCURACY = 0.80
client = VectaAPIClient() # reads VECTA_API_KEY from env
BENCHMARK_ID = "your-benchmark-id"
def my_retriever(query: str) -> list[str]:
"""Your retrieval function."""
# Replace with your actual retrieval logic
from your_app import search
return search(query, k=10)
def main():
print("Running retrieval evaluation...")
results = client.evaluate_retrieval(
benchmark_id=BENCHMARK_ID,
retrieval_function=my_retriever,
evaluation_name=f"ci-{sys.argv[1] if len(sys.argv) > 1 else 'main'}",
metadata={"trigger": "ci", "branch": sys.argv[1] if len(sys.argv) > 1 else "main"},
)
chunk_f1 = results.chunk_level.f1_score
doc_f1 = results.document_level.f1_score
print(f"Chunk F1: {chunk_f1:.2%}")
print(f"Document F1: {doc_f1:.2%}")
# Fail the build if below thresholds
if chunk_f1 < MIN_CHUNK_F1:
print(f"FAIL: Chunk F1 ({chunk_f1:.2%}) is below threshold ({MIN_CHUNK_F1:.0%})")
sys.exit(1)
print("PASS: All thresholds met")
if __name__ == "__main__":
main()
Full RAG Evaluation in CI
For a complete RAG evaluation that checks both retrieval and generation:
def my_rag(query: str) -> tuple[list[str], str]:
from your_app import rag_pipeline
return rag_pipeline(query)
results = client.evaluate_retrieval_and_generation(
benchmark_id=BENCHMARK_ID,
retrieval_generation_function=my_rag,
evaluation_name="ci-rag",
)
chunk_f1 = results.chunk_level.f1_score
accuracy = results.generation_metrics.accuracy
groundedness = results.generation_metrics.groundedness
print(f"Chunk F1: {chunk_f1:.2%}")
print(f"Accuracy: {accuracy:.2%}")
print(f"Groundedness: {groundedness:.2%}")
if accuracy < MIN_ACCURACY:
print(f"FAIL: Accuracy below threshold")
sys.exit(1)
Setting Up Secrets
Add your Vecta API key to GitHub Secrets:
- Go to your repository → Settings → Secrets and variables → Actions
- Click New repository secret
- Name:
VECTA_API_KEY - Value: your API key from the Settings page
Tips
- Pin your benchmark — Use a fixed
benchmark_idso CI results are comparable over time. Don't regenerate the benchmark on every run. - Use metadata — Attach the branch name, commit SHA, or PR number as metadata so you can trace results back to specific changes.
- Set reasonable thresholds — Start with loose thresholds and tighten them as your pipeline matures.
- Group into experiments — Create a CI-specific experiment to track all automated runs:
# Run once to create
experiment = client.create_experiment(name="CI Evaluations")
# Use in every CI run
results = client.evaluate_retrieval(
benchmark_id=BENCHMARK_ID,
retrieval_function=my_retriever,
experiment_id=experiment["id"],
# ...
)
Next Steps
- Experiments — Visualize CI runs over time
- Evaluations — Understand all available metrics