Files

Upload File

This endpoint is used to directly upload local files to Carbon. The POST request should be a multipart form request. Note that the set_page_as_boundary query parameter is applicable only to PDFs for now. When this value is set, PDF chunks are at most one page long. Additional information can be retrieved for each chunk, however, namely the coordinates of the bounding box around the chunk (this can be used for things like text highlighting). Following is a description of all possible query parameters:

chunk_size: the chunk size (in tokens) applied when splitting the document
chunk_overlap: the chunk overlap (in tokens) applied when splitting the document
skip_embedding_generation: whether or not to skip the generation of chunks and embeddings
set_page_as_boundary: described above
embedding_model: the model used to generate embeddings for the document chunks
use_ocr: whether or not to use OCR as a preprocessing step prior to generating chunks. Valid for PDFs, JPEGs, and PNGs
generate_sparse_vectors: whether or not to generate sparse vectors for the file. Required for hybrid search.
prepend_filename_to_chunks: whether or not to prepend the filename to the chunk text

Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI’s multimodal model; for text, we support OpenAI’s text-embedding-ada-002 and Cohere’s embed-multilingual-v3.0. The model can be specified via the embedding_model parameter (in the POST body for /embeddings, and a query parameter in /uploadfile). If no model is supplied, the text-embedding-ada-002 is used by default. When performing embedding queries, embeddings from files that used the specified model will be considered in the query. For example, if files A and B have embeddings generated with OPENAI, and files C and D have embeddings generated with COHERE_MULTILINGUAL_V3, then by default, queries will only consider files A and B. If COHERE_MULTILINGUAL_V3 is specified as the embedding_model in /embeddings, then only files C and D will be considered. Make sure that the set of all files you want considered for a query have embeddings generated via the same model. For now, do not set VERTEX_MULTIMODAL as an embedding_model. This model is used automatically by Carbon when it detects an image file.

POST

uploadfile

curl --request POST \
  --url https://api.carbon.ai/uploadfile \
  --header 'Content-Type: multipart/form-data' \
  --header 'authorization: <api-key>'

{
  "id": 123,
  "source": "GOOGLE_CLOUD_STORAGE",
  "organization_id": 123,
  "organization_user_id": 123,
  "organization_supplied_user_id": "<string>",
  "organization_user_data_source_id": 123,
  "external_file_id": "<string>",
  "external_url": "<string>",
  "sync_status": "DELAYED",
  "sync_error_message": "<string>",
  "last_sync": "2023-11-07T05:31:56Z",
  "tags": {},
  "file_statistics": {
    "file_format": "TXT",
    "file_size": 123,
    "num_characters": 123,
    "num_tokens": 123,
    "num_embeddings": 123,
    "mime_type": "<string>"
  },
  "file_metadata": {},
  "embedding_properties": {},
  "chunk_size": 123,
  "chunk_overlap": 123,
  "chunk_properties": {
    "set_page_as_boundary": false,
    "prepend_filename_to_chunks": false,
    "max_items_per_chunk": 123
  },
  "ocr_properties": {},
  "ocr_job_started_at": "2023-11-07T05:31:56Z",
  "name": "<string>",
  "parent_id": 123,
  "enable_auto_sync": true,
  "presigned_url": "<string>",
  "parsed_text_url": "<string>",
  "additional_presigned_urls": {},
  "skip_embedding_generation": true,
  "source_created_at": "2023-11-07T05:31:56Z",
  "generate_sparse_vectors": true,
  "request_id": "<string>",
  "upload_id": "<string>",
  "sync_properties": {},
  "messages_metadata": {},
  "file_contents_deleted": false,
  "supports_cold_storage": true,
  "hot_storage_time_to_live": 123,
  "embedding_storage_status": "HOT_STORAGE",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z"
}

Authorizations

authorization

string

header

required

token <token>, corresponds to temporary access tokens.

Query Parameters

chunk_size

integer | null

Chunk size in tiktoken tokens to be used when processing file.

chunk_overlap

integer | null

Chunk overlap in tiktoken tokens to be used when processing file.

skip_embedding_generation

boolean

default:false

Flag to control whether or not embeddings should be generated and stored when processing file.

set_page_as_boundary

boolean

default:false

Flag to control whether or not to set the a page's worth of content as the maximum amount of content that can appear in a chunk. Only valid for PDFs. See description route description for more information.

embedding_model

default:OPENAI

Embedding model that will be used to embed file chunks.

Available options:

OPENAI,

AZURE_OPENAI,

COHERE_MULTILINGUAL_V3,

OPENAI_ADA_LARGE_256,

OPENAI_ADA_LARGE_1024,

OPENAI_ADA_LARGE_3072,

OPENAI_ADA_SMALL_512,

OPENAI_ADA_SMALL_1536,

AZURE_ADA_LARGE_256,

AZURE_ADA_LARGE_1024,

AZURE_ADA_LARGE_3072,

AZURE_ADA_SMALL_512,

AZURE_ADA_SMALL_1536,

SOLAR_1_MINI

use_ocr

boolean

default:false

Whether or not to use OCR when processing files. Valid for PDFs, JPEGs, and PNGs. Useful for documents with tables, images, and/or scanned text.

generate_sparse_vectors

boolean

default:false

Whether or not to generate sparse vectors for the file. This is required for the file to be a candidate for hybrid search.

prepend_filename_to_chunks

boolean

default:false

Whether or not to prepend the file's name to chunks.

max_items_per_chunk

integer | null

Number of objects per chunk. For csv, tsv, xlsx, and json files only.

Required range: x > 0

parse_pdf_tables_with_ocr

boolean

default:false

Whether to use rich table parsing when use_ocr is enabled.

detect_audio_language

boolean

default:false

Whether to automatically detect the language of the uploaded audio file.

transcription_service

enum<string> | null

The transcription service to use for audio files. If no service is specified, 'deepgram' will be used.

Available options:

assemblyai,

deepgram

include_speaker_labels

boolean

default:false

Detect multiple speakers and label segments of speech by speaker for audio files.

media_type

enum<string> | null

The media type of the file. If not provided, it will be inferred from the file extension.

Available options:

TEXT,

IMAGE,

AUDIO,

VIDEO

split_rows

boolean

default:false

Whether to split tabular rows into chunks. Currently only valid for CSV, TSV, and XLSX files.

enable_cold_storage

boolean

default:false

Enable cold storage for the file. If set to true, the file will be moved to cold storage after a certain period of inactivity. Default is false.

hot_storage_time_to_live

integer | null

Time in days after which the file will be moved to cold storage. Must be one of [1, 3, 7, 14, 30].

generate_chunks_only

boolean

default:false

If this flag is enabled, the file will be chunked and stored with Carbon, but no embeddings will be generated. This overrides the skip_embedding_generation flag.

store_file_only

boolean

default:false

If this flag is enabled, the file will be stored with Carbon, but no processing will be done.

Body

multipart/form-data

file

required

Response

200

application/json

Successful Response

integer

required

source

enum<string>

required

Available options:

GOOGLE_CLOUD_STORAGE,

GOOGLE_DRIVE,

NOTION,

NOTION_DATABASE,

INTERCOM,

DROPBOX,

ONEDRIVE,

SHAREPOINT,

CONFLUENCE,

BOX,

ZENDESK,

ZOTERO,

S3,

AZURE_BLOB_STORAGE,

GMAIL,

OUTLOOK,

SERVICENOW,

TEXT,

CSV,

TSV,

PDF,

DOCX,

PPTX,

XLSX,

XLSM,

MD,

RTF,

JSON,

HTML,

RAW_TEXT,

WEB_SCRAPE,

RSS_FEED,

FRESHDESK,

GITBOOK,

SALESFORCE,

GITHUB,

SLACK,

GURU,

GONG,

DOCUMENT360,

JPG,

PNG,

JPEG,

MP3,

MP2,

AAC,

WAV,

FLAC,

PCM,

M4A,

OGG,

OPUS,

MPEG,

MPG,

MP4,

WMV,

AVI,

MOV,

MKV,

FLV,

WEBM,

EML,

MSG

organization_id

integer

required

organization_user_id

integer | null

required

organization_supplied_user_id

string

required

external_file_id

string

required

sync_status

enum<string>

required

Available options:

DELAYED,

QUEUED_FOR_SYNC,

SYNCING,

READY,

SYNC_ERROR,

EVALUATING_RESYNC,

RATE_LIMITED,

SYNC_ABORTED,

QUEUED_FOR_OCR,

READY_TO_SYNC

skip_embedding_generation

boolean

required

supports_cold_storage

boolean

required

embedding_storage_status

enum<string>

required

Available options:

HOT_STORAGE,

HOT_TO_COLD,

COLD_STORAGE,

COLD_TO_HOT

created_at

string

required

updated_at

string

required

organization_user_data_source_id

integer | null

external_url

string | null

sync_error_message

string | null

last_sync

string | null

tags

object | null

file_statistics

object | null

file_metadata

object | null

embedding_properties

object | null

chunk_size

integer | null

chunk_overlap

integer | null

chunk_properties

object | null

ocr_properties

object

ocr_job_started_at

string | null

name

string | null

parent_id

integer | null

enable_auto_sync

boolean | null

presigned_url

string | null

parsed_text_url

string | null

additional_presigned_urls

object | null

source_created_at

string | null

generate_sparse_vectors

boolean | null

request_id

string | null

upload_id

string | null

sync_properties

object

messages_metadata

object

file_contents_deleted

boolean

default:false

hot_storage_time_to_live

integer | null

Generate OAuth URL Upload Text

curl --request POST \
  --url https://api.carbon.ai/uploadfile \
  --header 'Content-Type: multipart/form-data' \
  --header 'authorization: <api-key>'

{
  "id": 123,
  "source": "GOOGLE_CLOUD_STORAGE",
  "organization_id": 123,
  "organization_user_id": 123,
  "organization_supplied_user_id": "<string>",
  "organization_user_data_source_id": 123,
  "external_file_id": "<string>",
  "external_url": "<string>",
  "sync_status": "DELAYED",
  "sync_error_message": "<string>",
  "last_sync": "2023-11-07T05:31:56Z",
  "tags": {},
  "file_statistics": {
    "file_format": "TXT",
    "file_size": 123,
    "num_characters": 123,
    "num_tokens": 123,
    "num_embeddings": 123,
    "mime_type": "<string>"
  },
  "file_metadata": {},
  "embedding_properties": {},
  "chunk_size": 123,
  "chunk_overlap": 123,
  "chunk_properties": {
    "set_page_as_boundary": false,
    "prepend_filename_to_chunks": false,
    "max_items_per_chunk": 123
  },
  "ocr_properties": {},
  "ocr_job_started_at": "2023-11-07T05:31:56Z",
  "name": "<string>",
  "parent_id": 123,
  "enable_auto_sync": true,
  "presigned_url": "<string>",
  "parsed_text_url": "<string>",
  "additional_presigned_urls": {},
  "skip_embedding_generation": true,
  "source_created_at": "2023-11-07T05:31:56Z",
  "generate_sparse_vectors": true,
  "request_id": "<string>",
  "upload_id": "<string>",
  "sync_properties": {},
  "messages_metadata": {},
  "file_contents_deleted": false,
  "supports_cold_storage": true,
  "hot_storage_time_to_live": 123,
  "embedding_storage_status": "HOT_STORAGE",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z"
}

API Documentation

Health

Auth

Files

User

Web Scrape

Data Source

Gitbook

S3

SharePoint

GitHub

Gmail

Slack

Outlook

Organizations

Tags

Chunks / Embeddings

Retrieval

Webhooks

White Labeling

CRM

Upload File

Authorizations

Query Parameters

Body

Response