Managing Documents

Add documents to your knowledge bases for semantic search and RAG-powered AI responses.

Supported File Types

Format	Extension	Description
PDF	`.pdf`	Full text extraction with layout preservation
Word	`.docx`, `.doc`	Microsoft Word documents
Text	`.txt`	Plain text files
Markdown	`.md`	Markdown with formatting
HTML	`.html`, `.htm`	Web pages and HTML documents
JSON	`.json`	Structured JSON data
CSV	`.csv`	Tabular data
Code	Various	Source code files (JS, TS, PY, etc.)

Adding Documents

Via Dashboard

Navigate to Knowledge Bases > your knowledge base
Click Add Documents
Drag and drop files or click to browse
Wait for processing to complete

Via API

# Upload a file
curl -X POST https://api.flowmaestro.ai/knowledge-bases/{id}/documents \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"

# Add from URL
curl -X POST https://api.flowmaestro.ai/knowledge-bases/{id}/documents \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/page",
    "title": "Page Title"
  }'

From URLs

Add web pages directly:

Click Add from URL
Enter the URL
Optionally scrape subpages
FlowMaestro fetches and processes the content

URL options:

Scraping mode: html, text, or markdown
Include subpages: Follow links on the page
Max depth: How many levels deep to crawl

From Connected Apps

Import documents from your connected integrations (Google Drive, Dropbox, Notion, etc.):

Click Import (cloud icon)
Select a connected app from the dropdown
Browse folders or search for files
Select files or choose "Import entire folder"
Configure sync options (optional)
Click Import

Supported integrations:

Provider	Content Type	Features
Google Drive	Files	Browse folders, import files, continuous sync
Dropbox	Files	Browse folders, import files, continuous sync
OneDrive	Files	Browse folders, import files, continuous sync
Box	Files	Browse folders, import files, continuous sync
Notion	Pages	Pages converted to markdown, continuous sync
Confluence	Pages	Pages converted to markdown, continuous sync

Sync options:

Enable sync: Automatically check for updates
Sync interval: How often to check (15 min to 24 hours)
Manual sync: Trigger sync on-demand

Change detection:

During sync, FlowMaestro only processes files that have changed:

Compares modification timestamps
Computes content hashes
Skips unchanged files for efficiency

Via API — Integration Import

# List available providers with document capabilities
curl -X GET https://api.flowmaestro.ai/knowledge-bases/{id}/integration/providers \
  -H "Authorization: Bearer YOUR_API_KEY"

# Browse files in a provider
curl -X GET "https://api.flowmaestro.ai/knowledge-bases/{id}/integration/{connectionId}/browse?folderId=root" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Create an integration source (starts import)
curl -X POST https://api.flowmaestro.ai/knowledge-bases/{id}/integration/sources \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "connectionId": "conn_xxx",
    "sourceType": "folder",
    "sourceConfig": {
      "folderId": "folder_id",
      "recursive": true
    },
    "syncEnabled": true,
    "syncIntervalMinutes": 60
  }'

# Trigger manual sync
curl -X POST https://api.flowmaestro.ai/knowledge-bases/{id}/integration/sources/{sourceId}/sync \
  -H "Authorization: Bearer YOUR_API_KEY"

Document Processing Pipeline

When you upload a document, it goes through several stages:

1. Text Extraction

Content is extracted based on file type:

File Type	Extraction Method
PDF	Text layer extraction, OCR fallback
DOCX	XML parsing with style preservation
HTML	DOM parsing, content extraction
Markdown	Parsed and normalized
JSON	Converted to readable text
CSV	Tabular structure preserved

2. Cleaning

Text is normalized:

Whitespace normalization
Special character handling
Encoding fixes
Noise removal (headers, footers, page numbers)

3. Chunking

Documents are split into searchable chunks:

{
  chunkSize: 1000,     // Characters per chunk
  chunkOverlap: 200    // Overlap between chunks
}

Chunking strategies:

Strategy	Description	Best For
Fixed	Fixed character count	General use
Semantic	Respects sentence/paragraph boundaries	Prose documents
Recursive	Splits by headers, then paragraphs	Structured documents

4. Embedding

Each chunk is converted to a vector embedding:

"This is chunk text..." → [0.012, -0.034, 0.056, ...] (1536 dimensions)

FlowMaestro uses text-embedding-3-small by default.

5. Storage

Embeddings are stored in a vector database (PostgreSQL with pgvector):

-- Each chunk becomes a row
documents (
  id,
  knowledge_base_id,
  content,
  embedding,
  metadata,
  created_at
)

Processing Status

Documents progress through statuses:

Status	Description
`pending`	Queued for processing
`extracting`	Extracting text content
`chunking`	Splitting into chunks
`embedding`	Generating embeddings
`ready`	Available for queries
`failed`	Processing error

View status in the dashboard or via API:

GET /api/knowledge-bases/{id}/documents/{docId}

{
  "id": "doc_123",
  "status": "ready",
  "filename": "guide.pdf",
  "chunks": 45,
  "created_at": "2024-01-15T10:30:00Z"
}

Chunking Configuration

Adjust chunking for your content:

Small Chunks (500 chars)

{
  chunkSize: 500,
  chunkOverlap: 100
}

Pros: More precise retrieval Cons: May lose context Best for: FAQ, glossaries, short answers

Large Chunks (2000 chars)

{
  chunkSize: 2000,
  chunkOverlap: 400
}

Pros: More context per result Cons: Less precise matching Best for: Technical docs, long-form content

Overlap

Overlap ensures context isn't lost at chunk boundaries:

Chunk 1: [-------content-------]
Chunk 2:              [-------content-------]
                 ^^^^
                overlap

Metadata

Documents include metadata for filtering:

{
  filename: "guide.pdf",
  title: "Product Guide",
  mimeType: "application/pdf",
  fileSize: 1024000,
  pageCount: 25,
  author: "Jane Smith",
  createdAt: "2024-01-15T10:30:00Z",
  url: "https://example.com/guide",
  tags: ["product", "guide"]
}

Add custom metadata via API:

POST /api/knowledge-bases/{id}/documents \
  -F "file=@doc.pdf" \
  -F "metadata={\"department\":\"engineering\",\"version\":\"2.0\"}"