Back to Docs
Data Sources

Overview

Understanding data sources in Vecta

Last updated: August 19, 2025
Category: data-sources

Data Sources

Data sources are how Vecta connects to your data. They provide the chunks used for benchmark generation and evaluation.

Types of Data Sources

File Stores (Recommended for Getting Started)

Upload documents directly - Vecta will process and chunk them.

Supported:

  • Local files
  • PDFs, DOCX, TXT, and more

Use when:

  • Getting started with Vecta
  • You want Vecta to handle document processing
  • Testing with new document sets

Learn more →

Vector Databases (For Production Systems)

Connect to existing vector databases with your embedded content.

Supported:

  • ChromaDB (Local & Cloud)
  • Pinecone
  • PostgreSQL + pgvector
  • Azure Cosmos DB
  • Databricks Vector Search
  • Weaviate
  • LangChain VectorStores
  • LlamaIndex

Use when:

  • You already have embedded documents in a vector database
  • You need production RAG evaluation
  • You want custom chunking strategies

Learn more →

How Data Sources Work

# 1. Upload your documents (simplest approach)
data_source = client.upload_local_files([
    "docs/manual.pdf",
    "guide.docx"
])

# Or connect to existing vector database
db = client.connect_pinecone(
    api_key="...",
    index_name="docs"
)

# 2. Vecta samples chunks from your data
# 3. Generates questions based on those chunks
# 4. Builds comprehensive benchmarks
benchmark = client.create_benchmark(
    data_source_id=data_source["id"],  # or db["id"]
    questions_count=100
)

# 5. Uses the same data source for evaluation
results = client.evaluate_retrieval(
    benchmark_id=benchmark["id"],
    retrieval_function=my_retriever
)

Required Metadata

All data sources must provide:

  • id: Unique identifier for each chunk
  • content: The text content
  • source_path: Document name (e.g., "report.pdf")
  • page_nums: Page numbers (e.g., [1, 2, 3])

Schema Configuration

Use schemas to map your data structure:

from vecta.core.schemas import VectorDBSchema

schema = VectorDBSchema(
    id_accessor="id",
    content_accessor="text",
    source_path_accessor="metadata.filename",
    page_nums_accessor="metadata.pages"
)

Learn about accessor syntax →

Cloud vs Local

Cloud API

from vecta import VectaAPIClient

client = VectaAPIClient(api_key="your-key")
db = client.connect_pinecone(...)

Pros:

  • No infrastructure to manage
  • Web dashboard for results
  • Team collaboration

Local SDK

from vecta import VectaClient
from vecta.connectors.chroma_local_connector import ChromaLocalConnector

connector = ChromaLocalConnector(...)
vecta = VectaClient(vector_db_connector=connector)

Pros:

  • Full control over data
  • No API calls for evaluation
  • Works offline

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.