Upload File
This endpoint is used to directly upload local files to Carbon. The POST
request should be a multipart form request.
Note that the set_page_as_boundary
query parameter is applicable only to PDFs for now. When this value is set,
PDF chunks are at most one page long. Additional information can be retrieved for each chunk, however, namely the coordinates
of the bounding box around the chunk (this can be used for things like text highlighting). Following is a description
of all possible query parameters:
chunk_size
: the chunk size (in tokens) applied when splitting the documentchunk_overlap
: the chunk overlap (in tokens) applied when splitting the documentskip_embedding_generation
: whether or not to skip the generation of chunks and embeddingsset_page_as_boundary
: described aboveembedding_model
: the model used to generate embeddings for the document chunksuse_ocr
: whether or not to use OCR as a preprocessing step prior to generating chunks. Valid for PDFs, JPEGs, and PNGsgenerate_sparse_vectors
: whether or not to generate sparse vectors for the file. Required for hybrid search.prepend_filename_to_chunks
: whether or not to prepend the filename to the chunk text
Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI’s
multimodal model; for text, we support OpenAI’s text-embedding-ada-002
and Cohere’s embed-multilingual-v3.0.
The model can be specified via the embedding_model
parameter (in the POST body for /embeddings
, and a query
parameter in /uploadfile
). If no model is supplied, the text-embedding-ada-002
is used by default. When performing
embedding queries, embeddings from files that used the specified model will be considered in the query.
For example, if files A and B have embeddings generated with OPENAI
, and files C and D have embeddings generated with
COHERE_MULTILINGUAL_V3
, then by default, queries will only consider files A and B. If COHERE_MULTILINGUAL_V3
is
specified as the embedding_model
in /embeddings
, then only files C and D will be considered. Make sure that
the set of all files you want considered for a query have embeddings generated via the same model. For now, do not
set VERTEX_MULTIMODAL
as an embedding_model
. This model is used automatically by Carbon when it detects an image file.
Authorizations
token <token>
, corresponds to temporary access tokens.
Query Parameters
Chunk size in tiktoken tokens to be used when processing file.
Chunk overlap in tiktoken tokens to be used when processing file.
Flag to control whether or not embeddings should be generated and stored when processing file.
Flag to control whether or not to set the a page's worth of content as the maximum amount of content that can appear in a chunk. Only valid for PDFs. See description route description for more information.
Embedding model that will be used to embed file chunks.
OPENAI
, AZURE_OPENAI
, COHERE_MULTILINGUAL_V3
, OPENAI_ADA_LARGE_256
, OPENAI_ADA_LARGE_1024
, OPENAI_ADA_LARGE_3072
, OPENAI_ADA_SMALL_512
, OPENAI_ADA_SMALL_1536
, AZURE_ADA_LARGE_256
, AZURE_ADA_LARGE_1024
, AZURE_ADA_LARGE_3072
, AZURE_ADA_SMALL_512
, AZURE_ADA_SMALL_1536
, SOLAR_1_MINI
Whether or not to use OCR when processing files. Valid for PDFs, JPEGs, and PNGs. Useful for documents with tables, images, and/or scanned text.
Whether or not to generate sparse vectors for the file. This is required for the file to be a candidate for hybrid search.
Whether or not to prepend the file's name to chunks.
Number of objects per chunk. For csv, tsv, xlsx, and json files only.
x > 0
Whether to use rich table parsing when use_ocr
is enabled.
Whether to automatically detect the language of the uploaded audio file.
The transcription service to use for audio files. If no service is specified, 'deepgram' will be used.
assemblyai
, deepgram
Detect multiple speakers and label segments of speech by speaker for audio files.
The media type of the file. If not provided, it will be inferred from the file extension.
TEXT
, IMAGE
, AUDIO
, VIDEO
Whether to split tabular rows into chunks. Currently only valid for CSV, TSV, and XLSX files.
Enable cold storage for the file. If set to true, the file will be moved to cold storage after a certain period of inactivity. Default is false.
Time in days after which the file will be moved to cold storage. Must be one of [1, 3, 7, 14, 30].
If this flag is enabled, the file will be chunked and stored with Carbon, but no embeddings will be generated. This overrides the skip_embedding_generation flag.
If this flag is enabled, the file will be stored with Carbon, but no processing will be done.
Body
Response
HOT_STORAGE
, HOT_TO_COLD
, COLD_STORAGE
, COLD_TO_HOT
GOOGLE_CLOUD_STORAGE
, GOOGLE_DRIVE
, NOTION
, NOTION_DATABASE
, INTERCOM
, DROPBOX
, ONEDRIVE
, SHAREPOINT
, CONFLUENCE
, BOX
, ZENDESK
, ZOTERO
, S3
, AZURE_BLOB_STORAGE
, GMAIL
, OUTLOOK
, SERVICENOW
, TEXT
, CSV
, TSV
, PDF
, DOCX
, PPTX
, XLSX
, XLSM
, MD
, RTF
, JSON
, HTML
, RAW_TEXT
, WEB_SCRAPE
, RSS_FEED
, FRESHDESK
, GITBOOK
, SALESFORCE
, GITHUB
, SLACK
, GURU
, GONG
, DOCUMENT360
, JPG
, PNG
, JPEG
, MP3
, MP2
, AAC
, WAV
, FLAC
, PCM
, M4A
, OGG
, OPUS
, MPEG
, MPG
, MP4
, WMV
, AVI
, MOV
, MKV
, FLV
, WEBM
, EML
, MSG
DELAYED
, QUEUED_FOR_SYNC
, SYNCING
, READY
, SYNC_ERROR
, EVALUATING_RESYNC
, RATE_LIMITED
, SYNC_ABORTED
, QUEUED_FOR_OCR
, READY_TO_SYNC