Data Sources

Data sources are how Vecta connects to your data. They provide the chunks used for benchmark generation and evaluation.

Types of Data Sources

File Stores (Recommended for Getting Started)

Upload documents directly - Vecta will process and chunk them.

Supported:

Local files
PDFs, DOCX, TXT, and more

Use when:

Getting started with Vecta
You want Vecta to handle document processing
Testing with new document sets

Learn more →

Vector Databases (For Production Systems)

Connect to existing vector databases with your embedded content.

Supported:

ChromaDB (Local & Cloud)
Pinecone
PostgreSQL + pgvector
Azure Cosmos DB
Databricks Vector Search
Weaviate
LangChain VectorStores
LlamaIndex

Use when:

You already have embedded documents in a vector database
You need production RAG evaluation
You want custom chunking strategies

Learn more →

How Data Sources Work

# 1. Upload your documents (simplest approach)
data_source = client.upload_local_files([
    "docs/manual.pdf",
    "guide.docx"
])

# Or connect to existing vector database
db = client.connect_pinecone(
    api_key="...",
    index_name="docs"
)

# 2. Vecta samples chunks from your data
# 3. Generates questions based on those chunks
# 4. Builds comprehensive benchmarks
benchmark = client.create_benchmark(
    data_source_id=data_source["id"],  # or db["id"]
    questions_count=100
)

# 5. Uses the same data source for evaluation
results = client.evaluate_retrieval(
    benchmark_id=benchmark["id"],
    retrieval_function=my_retriever
)

Required Metadata

All data sources must provide:

id: Unique identifier for each chunk
content: The text content
source_path: Document name (e.g., "report.pdf")
page_nums: Page numbers (e.g., [1, 2, 3])

Schema Configuration

Use schemas to map your data structure:

from vecta.core.schemas import VectorDBSchema

schema = VectorDBSchema(
    id_accessor="id",
    content_accessor="text",
    source_path_accessor="metadata.filename",
    page_nums_accessor="metadata.pages"
)

Learn about accessor syntax →

Cloud vs Local

Cloud API

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")
db = client.connect_pinecone(...)

Pros:

No infrastructure to manage
Web dashboard for results
Team collaboration

Local SDK

from vecta import VectaClient
from vecta.connectors.chroma_local_connector import ChromaLocalConnector

connector = ChromaLocalConnector(...)
vecta = VectaClient(vector_db_connector=connector)