Back to Docs
Data Sources

File Stores

Ingest PDFs, DOCX, PPTX, and other files directly

Last updated: August 20, 2025
Category: data-sources

File Store Connectors

File store connectors let you ingest raw documents (PDF, DOCX, PPTX, XLSX, TXT, HTML, etc.) directly into Vecta. The files are converted to Markdown using markitdown, split into page-level chunks, and embedded automatically.

Supported Formats

Any format that markitdown supports, including:

  • PDF
  • Microsoft Word (.docx)
  • PowerPoint (.pptx)
  • Excel (.xlsx)
  • Plain text (.txt)
  • HTML
  • And many more

Using the API Client

The simplest way to ingest files is through the API client:

from vecta import VectaAPIClient

client = VectaAPIClient()

data_source = client.upload_local_files(
    file_paths=[
        "/path/to/report.pdf",
        "/path/to/manual.docx",
        "/path/to/data.xlsx",
    ],
)

print(f"ID: {data_source['id']}")
print(f"Chunks: {data_source['chunks_count']}")
print(f"Status: {data_source['status']}")

The server handles conversion, chunking, embedding, and storage. The resulting data source appears in your Data Sources dashboard.

Using the Local SDK

For local-only workflows, use FileStoreConnector:

from vecta import VectaClient, FileStoreConnector

connector = FileStoreConnector(
    file_paths=["document1.pdf", "document2.docx"],
    base_path="/path/to/files",
)

vecta = VectaClient(
    data_source_connector=connector,
    openai_api_key="sk-...",
)

chunks = vecta.load_knowledge_base()
print(f"Loaded {len(chunks)} chunks")

Note: FileStoreConnector does not require a VectorDBSchema. Vecta controls the chunking format internally.

Constructor Parameters

ParameterTypeDefaultDescription
file_pathsList[str]requiredFile paths relative to base_path
base_pathstr"/mnt/file_stores"Base directory where files are located

How Chunking Works

  1. Each file is converted to Markdown via markitdown.
  2. The Markdown is split into pages using form-feed characters (\f) for PDFs, or top-level headings (# ...) as a fallback.
  3. Each page becomes one ChunkData object with:
    • A unique id derived from the filename and page index
    • content — the page's Markdown text
    • source_path — the filename (e.g., "report.pdf")
    • page_nums — [page_index]

Using the Platform UI

  1. Navigate to Data Sources
  2. Click Upload Files
  3. Drag and drop your files
  4. Vecta will ingest, chunk, and embed them

Once connected, you can browse individual chunks, sync the data source, and use it as the basis for benchmark generation.

Next Steps

Need Help?

Can't find what you're looking for? Our team is here to help.