Back to Docs
Advanced

CI/CD Integration

Run evaluations in GitHub Actions and catch regressions

Last updated: August 20, 2025
Category: advanced

CI/CD Integration

Run Vecta evaluations automatically in your CI/CD pipeline to catch regressions before they reach production. This guide shows how to set up GitHub Actions, but the same approach works with any CI runner.

GitHub Actions Workflow

Create .github/workflows/vecta-eval.yml:

name: RAG Evaluation

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v4
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install vecta
          pip install -r requirements.txt

      - name: Run evaluation
        env:
          VECTA_API_KEY: ${{ secrets.VECTA_API_KEY }}
        run: python scripts/evaluate.py

Evaluation Script

Create scripts/evaluate.py:

#!/usr/bin/env python3
"""CI evaluation script that fails on regression."""

import sys
from vecta import VectaAPIClient

# Thresholds — adjust to your requirements
MIN_CHUNK_F1 = 0.70
MIN_ACCURACY = 0.80

client = VectaAPIClient()  # reads VECTA_API_KEY from env

BENCHMARK_ID = "your-benchmark-id"


def my_retriever(query: str) -> list[str]:
    """Your retrieval function."""
    # Replace with your actual retrieval logic
    from your_app import search
    return search(query, k=10)


def main():
    print("Running retrieval evaluation...")
    results = client.evaluate_retrieval(
        benchmark_id=BENCHMARK_ID,
        retrieval_function=my_retriever,
        evaluation_name=f"ci-{sys.argv[1] if len(sys.argv) > 1 else 'main'}",
        metadata={"trigger": "ci", "branch": sys.argv[1] if len(sys.argv) > 1 else "main"},
    )

    chunk_f1 = results.chunk_level.f1_score
    doc_f1 = results.document_level.f1_score

    print(f"Chunk F1:    {chunk_f1:.2%}")
    print(f"Document F1: {doc_f1:.2%}")

    # Fail the build if below thresholds
    if chunk_f1 < MIN_CHUNK_F1:
        print(f"FAIL: Chunk F1 ({chunk_f1:.2%}) is below threshold ({MIN_CHUNK_F1:.0%})")
        sys.exit(1)

    print("PASS: All thresholds met")


if __name__ == "__main__":
    main()

Full RAG Evaluation in CI

For a complete RAG evaluation that checks both retrieval and generation:

def my_rag(query: str) -> tuple[list[str], str]:
    from your_app import rag_pipeline
    return rag_pipeline(query)

results = client.evaluate_retrieval_and_generation(
    benchmark_id=BENCHMARK_ID,
    retrieval_generation_function=my_rag,
    evaluation_name="ci-rag",
)

chunk_f1 = results.chunk_level.f1_score
accuracy = results.generation_metrics.accuracy
groundedness = results.generation_metrics.groundedness

print(f"Chunk F1:      {chunk_f1:.2%}")
print(f"Accuracy:      {accuracy:.2%}")
print(f"Groundedness:  {groundedness:.2%}")

if accuracy < MIN_ACCURACY:
    print(f"FAIL: Accuracy below threshold")
    sys.exit(1)

Setting Up Secrets

Add your Vecta API key to GitHub Secrets:

  1. Go to your repository → Settings → Secrets and variables → Actions
  2. Click New repository secret
  3. Name: VECTA_API_KEY
  4. Value: your API key from the Settings page

Tips

  • Pin your benchmark — Use a fixed benchmark_id so CI results are comparable over time. Don't regenerate the benchmark on every run.
  • Use metadata — Attach the branch name, commit SHA, or PR number as metadata so you can trace results back to specific changes.
  • Set reasonable thresholds — Start with loose thresholds and tighten them as your pipeline matures.
  • Group into experiments — Create a CI-specific experiment to track all automated runs:
# Run once to create
experiment = client.create_experiment(name="CI Evaluations")

# Use in every CI run
results = client.evaluate_retrieval(
    benchmark_id=BENCHMARK_ID,
    retrieval_function=my_retriever,
    experiment_id=experiment["id"],
    # ...
)

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.