The maximum file size is 20 MB, but it can be increased upon request. However, please note that 3rd party connectors may have their own file size limits, which you can check here.

You have the option to select from multiple Embedding models based on your use case.

Supported File Formats

Carbon supports the following text files formats:

FileMore Details
PDFCarbon supports rich metadata such as document author and title. When chunking PDFs, we can return coordinates points so users can highlight specific blocks of text in the actual PDF document.
XLSXEach row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Images and charts aren’t supported yet.
XLSMEach row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Images and charts aren’t supported yet.
CSVEvery row is situated on an individual line, and each element within the row is prefixed with the header of its respective column. Our parser assumes that the first row and only the first row is the header.
DOCXText content is extracted and converted to plain text. Images and charts aren’t supported yet.
TXTText content is extracted and converted to plain text.
MDText content is extracted and converted to plain text.
RTFText content is extracted and converted to plain text.
TSVEvery row is situated on an individual line, and each element within the row is prefixed with the header of its respective column. Our parser assumes that the first row and only the first row is the header.
PPTXText content is extracted and converted to plain text. Images and charts aren’t supported yet.
JSONCarbon iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.
HTMLText content is extracted and converted to plain text.
EMLText content is extracted and converted to plain text.
MSGText content is extracted and converted to plain text.

Processing Semi-Structured Data

Chunking

Applies to JSON, CSV, TSV, XSLX, and GSheet files.
A new chunk is created if either the max_items_per_chunk or chunk_size limit is reached.

For example:

  • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.
  • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

Consequently, it is essential to ensure that the number of tokens in a flattened JSON does not surpass the token limits established by the embedding models.

Parsing JSON

The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

For example, given this object

{
  key: value,
  nested_dict: { nested_key: nested_value },
  nested_list: [list_element_0, list_element_1],
}

it becomes

{
  key: value,
  nested_dict.nested_key: nested_value,
  nested_list.0: list_element_0,
  nested_list.1: list_element_1,
}

When processing a file containing a list of objects, the initial element of each path represents the object’s index in the topmost list. The key-value pairs of each flattened object are combined using commas and newlines to form a chunk. Any braces or quotation marks surrounding strings are omitted.