File Stores
Ingest PDFs, DOCX, PPTX, and other files directly
File Store Connectors
File store connectors let you ingest raw documents (PDF, DOCX, PPTX, XLSX, TXT, HTML, etc.) directly into Vecta. The files are converted to Markdown using markitdown, split into page-level chunks, and embedded automatically.
Supported Formats
Any format that markitdown supports, including:
- Microsoft Word (
.docx) - PowerPoint (
.pptx) - Excel (
.xlsx) - Plain text (
.txt) - HTML
- And many more
Using the API Client
The simplest way to ingest files is through the API client:
from vecta import VectaAPIClient
client = VectaAPIClient()
data_source = client.upload_local_files(
file_paths=[
"/path/to/report.pdf",
"/path/to/manual.docx",
"/path/to/data.xlsx",
],
)
print(f"ID: {data_source['id']}")
print(f"Chunks: {data_source['chunks_count']}")
print(f"Status: {data_source['status']}")
The server handles conversion, chunking, embedding, and storage. The resulting data source appears in your Data Sources dashboard.
Using the Local SDK
For local-only workflows, use FileStoreConnector:
from vecta import VectaClient, FileStoreConnector
connector = FileStoreConnector(
file_paths=["document1.pdf", "document2.docx"],
base_path="/path/to/files",
)
vecta = VectaClient(
data_source_connector=connector,
openai_api_key="sk-...",
)
chunks = vecta.load_knowledge_base()
print(f"Loaded {len(chunks)} chunks")
Note:
FileStoreConnectordoes not require aVectorDBSchema. Vecta controls the chunking format internally.
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_paths | List[str] | required | File paths relative to base_path |
base_path | str | "/mnt/file_stores" | Base directory where files are located |
How Chunking Works
- Each file is converted to Markdown via markitdown.
- The Markdown is split into pages using form-feed characters (
\f) for PDFs, or top-level headings (# ...) as a fallback. - Each page becomes one
ChunkDataobject with:- A unique
idderived from the filename and page index content— the page's Markdown textsource_path— the filename (e.g.,"report.pdf")page_nums—[page_index]
- A unique
Using the Platform UI
- Navigate to Data Sources
- Click Upload Files
- Drag and drop your files
- Vecta will ingest, chunk, and embed them
Once connected, you can browse individual chunks, sync the data source, and use it as the basis for benchmark generation.
Next Steps
- Vector DB Connectors — Connect existing vector databases
- Benchmarks — Create evaluation datasets from your data