Text

The maximum file size is 20 MB, but it can be increased upon request. However, please note that 3rd party connectors may have their own file size limits, which you can check here.

You have the option to select from multiple Embedding models based on your use case.

Supported File Formats

Carbon supports the following text files formats:

File	More Details
`PDF`	Carbon supports rich metadata such as document `author` and `title`. When chunking PDFs, we can return coordinates points so users can highlight specific blocks of text in the actual PDF document.
`XLSX`	Each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Images and charts aren’t supported yet.
`XLSM`	Each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Images and charts aren’t supported yet.
`CSV`	Every row is situated on an individual line, and each element within the row is prefixed with the header of its respective column. Our parser assumes that the first row and only the first row is the header.
`DOCX`	Text content is extracted and converted to plain text. Images and charts aren’t supported yet.
`TXT`	Text content is extracted and converted to plain text.
`MD`	Text content is extracted and converted to plain text.
`RTF`	Text content is extracted and converted to plain text.
`TSV`	Every row is situated on an individual line, and each element within the row is prefixed with the header of its respective column. Our parser assumes that the first row and only the first row is the header.
`PPTX`	Text content is extracted and converted to plain text. Images and charts aren’t supported yet.
`JSON`	Carbon iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.
`HTML`	Text content is extracted and converted to plain text.
`EML`	Text content is extracted and converted to plain text.
`MSG`	Text content is extracted and converted to plain text.

Processing Semi-Structured Data

Chunking

Applies to JSON, CSV, TSV, XSLX, and GSheet files.

A new chunk is created if either the max_items_per_chunk or chunk_size limit is reached.

For example:

If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.
If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

Consequently, it is essential to ensure that the number of tokens in a flattened JSON does not surpass the token limits established by the embedding models.

Parsing JSON

The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

For example, given this object

{
  key: value,
  nested_dict: { nested_key: nested_value },
  nested_list: [list_element_0, list_element_1],
}

it becomes

{
  key: value,
  nested_dict.nested_key: nested_value,
  nested_list.0: list_element_0,
  nested_list.1: list_element_1,
}

When processing a file containing a list of objects, the initial element of each path represents the object’s index in the topmost list. The key-value pairs of each flattened object are combined using commas and newlines to form a chunk. Any braces or quotation marks surrounding strings are omitted.

Get Started

Source Connectors

Destination Connectors

Learn

Resources

Migrations

Supported File Formats

Processing Semi-Structured Data

Chunking

Parsing JSON

Get Started

Source Connectors

Destination Connectors

Learn

Resources

Migrations

​Supported File Formats

​Processing Semi-Structured Data

​Chunking

​Parsing JSON

Supported File Formats

Processing Semi-Structured Data

Chunking

Parsing JSON