Data Sources
Data sources provide knowledge for your AI agents through RAG (Retrieval Augmented Generation).
Overview
Source types:
- File - Documents, PDFs, text files
- URL - Web pages, APIs
- Database - External databases
- API - REST/GraphQL endpoints
Adding Sources
Via Dashboard
- Navigate to Context
- Click Add Source
- Select source type
- Configure and upload
Via API
bash
# File upload
curl -X POST https://your-domain.com/api/agents/{id}/sources \
-F "type=file" \
-F "name=Product Manual" \
-F "file=@/path/to/manual.pdf"
# URL source
curl -X POST https://your-domain.com/api/agents/{id}/sources \
-H "Content-Type: application/json" \
-d '{
"type": "url",
"name": "Documentation",
"url": "https://docs.example.com"
}'Source Types
File
Supported formats:
- PDF (.pdf)
- Text (.txt)
- Markdown (.md)
- Word (.docx)
- JSON (.json)
- CSV (.csv)
json
{
"type": "file",
"name": "User Guide",
"file": "<binary>"
}URL
Crawl web pages:
json
{
"type": "url",
"name": "Help Center",
"url": "https://help.example.com",
"config": {
"depth": 2,
"max_pages": 100
}
}Database
Connect to databases:
json
{
"type": "database",
"name": "Product Data",
"config": {
"connection_string": "postgresql://...",
"query": "SELECT * FROM products"
}
}API
Fetch from APIs:
json
{
"type": "api",
"name": "CRM Contacts",
"config": {
"url": "https://api.crm.com/contacts",
"method": "GET",
"headers": {
"Authorization": "Bearer {{api_key}}"
},
"schedule": "0 * * * *"
}
}Processing Pipeline
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source │────▶│ Extract │────▶│ Chunk │
│ Input │ │ Content │ │ Text │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Store in │◀────│ Generate │◀────│ Clean & │
│ Vectorize │ │ Embeddings │ │ Normalize │
└──────────────┘ └──────────────┘ └──────────────┘Source Status
| Status | Description |
|---|---|
pending | Waiting to process |
processing | Currently processing |
ready | Available for search |
error | Processing failed |
updating | Refreshing content |
Chunking Strategy
Content is split into searchable chunks:
json
{
"chunking": {
"method": "semantic",
"max_size": 1000,
"overlap": 200
}
}Methods:
semantic- Smart paragraph splittingfixed- Fixed character countsentence- Sentence boundaries
Managing Sources
List Sources
bash
curl https://your-domain.com/api/agents/{id}/sourcesGet Source Details
bash
curl https://your-domain.com/api/agents/{id}/sources/{sourceId}Delete Source
bash
curl -X DELETE https://your-domain.com/api/agents/{id}/sources/{sourceId}Refresh Source
bash
curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/refreshStorage
R2 Storage
Files are stored in Cloudflare R2:
- Automatic replication
- No egress fees
- Unlimited storage
Vectorize Index
Embeddings stored in Vectorize:
- Fast similarity search
- Automatic indexing
- Scalable to millions
Integration
With Chat
Sources are automatically searched:
User: "What's the return policy?"
Agent: [Searches sources] → [Finds relevant chunks] → [Generates response]With Workflows
json
{
"type": "search-sources",
"data": {
"query": "{{input.question}}",
"limit": 5
}
}Best Practices
1. Organize Sources
Group related content:
- Product documentation
- FAQ and support
- Policies and terms
2. Keep Content Fresh
Schedule regular updates:
json
{
"refresh_schedule": "0 0 * * *"
}3. Optimize Chunk Size
Balance context and precision:
- Larger chunks: More context
- Smaller chunks: Higher precision
4. Use Metadata
Add descriptive metadata:
json
{
"metadata": {
"category": "support",
"version": "2.0",
"language": "en"
}
}5. Monitor Quality
Review search results:
- Check relevance
- Update stale content
- Remove duplicates
API Reference
See Sources API for complete endpoint documentation.