Back to Docs
Data Sources
Overview
Understanding data sources in Vecta
Last updated: August 19, 2025
Category: data-sources
Data Sources
Data sources are how Vecta connects to your data. They provide the chunks used for benchmark generation and evaluation.
Types of Data Sources
File Stores (Recommended for Getting Started)
Upload documents directly - Vecta will process and chunk them.
Supported:
- Local files
- PDFs, DOCX, TXT, and more
Use when:
- Getting started with Vecta
- You want Vecta to handle document processing
- Testing with new document sets
Vector Databases (For Production Systems)
Connect to existing vector databases with your embedded content.
Supported:
- ChromaDB (Local & Cloud)
- Pinecone
- PostgreSQL + pgvector
- Azure Cosmos DB
- Databricks Vector Search
- Weaviate
- LangChain VectorStores
- LlamaIndex
Use when:
- You already have embedded documents in a vector database
- You need production RAG evaluation
- You want custom chunking strategies
How Data Sources Work
# 1. Upload your documents (simplest approach)
data_source = client.upload_local_files([
"docs/manual.pdf",
"guide.docx"
])
# Or connect to existing vector database
db = client.connect_pinecone(
api_key="...",
index_name="docs"
)
# 2. Vecta samples chunks from your data
# 3. Generates questions based on those chunks
# 4. Builds comprehensive benchmarks
benchmark = client.create_benchmark(
data_source_id=data_source["id"], # or db["id"]
questions_count=100
)
# 5. Uses the same data source for evaluation
results = client.evaluate_retrieval(
benchmark_id=benchmark["id"],
retrieval_function=my_retriever
)
Required Metadata
All data sources must provide:
id
: Unique identifier for each chunkcontent
: The text contentsource_path
: Document name (e.g., "report.pdf")page_nums
: Page numbers (e.g., [1, 2, 3])
Schema Configuration
Use schemas to map your data structure:
from vecta.core.schemas import VectorDBSchema
schema = VectorDBSchema(
id_accessor="id",
content_accessor="text",
source_path_accessor="metadata.filename",
page_nums_accessor="metadata.pages"
)
Learn about accessor syntax →
Cloud vs Local
Cloud API
from vecta import VectaAPIClient
client = VectaAPIClient(api_key="your-key")
db = client.connect_pinecone(...)
Pros:
- No infrastructure to manage
- Web dashboard for results
- Team collaboration
Local SDK
from vecta import VectaClient
from vecta.connectors.chroma_local_connector import ChromaLocalConnector
connector = ChromaLocalConnector(...)
vecta = VectaClient(vector_db_connector=connector)
Pros:
- Full control over data
- No API calls for evaluation
- Works offline
Next Steps
- File Stores → - Upload files directly (recommended first step)
- Vector Databases → - Connect existing databases
- Benchmarks → - Generate test datasets