Overview

Text Embeddings

Reranking Models

Image Embeddings

Video Embeddings

Usage

Supported Models

Documentation

Carbon

API Reference

Website

Support

Models

Introduction

Streamline third-party application management and file uploads with Carbon's pre-built React components

Implementation

Open source client libraries for your favorite platforms.

SDKs

The Carbon Connect `enabledIntegrations` value for Google Drive is `GOOGLE_DRIVE`.

Google Drive

The Carbon Connect `enabledIntegrations` value for SharePoint is `SHAREPOINT`.

SharePoint

The Carbon Connect `enabledIntegrations` value for OneDrive is `ONEDRIVE`.

OneDrive

The Carbon Connect `enabledIntegrations` value for Dropbox is `DROPBOX`.

Dropbox

The Carbon Connect `enabledIntegrations` value for Box is `BOX`.

The Carbon Connect `enabledIntegrations` value for Zotero is `ZOTERO`.

Zotero

The Carbon Connect `enabledIntegrations` value for Intercom is `INTERCOM`.

Intercom

The Carbon Connect `enabledIntegrations` value for Zendesk is `ZENDESK`.

Zendesk

The Carbon Connect `enabledIntegrations` value for Freshdesk is `FRESHDESK`.

Freshdesk

The Carbon Connect enabledIntegrations value for Notion is `NOTION`.

Notion

The Carbon Connect `enabledIntegrations` value for Confluence is `Confluence`.

Confluence

The Carbon Connect `enabledIntegrations` value for Gitbook is `GITBOOK`.

Gitbook

The Carbon Connect `enabledIntegrations` value for Salesforce is `SALESFORCE`.

Salesforce

The Carbon Connect `enabledIntegrations` value for Gmail is `GMAIL`

Gmail

The Carbon Connect `enabledIntegrations` value for Outlook is `OUTLOOK`.

Outlook

The Carbon Connect enabledIntegrations value for Slack is `SLACK`.

Slack

The Carbon Connect `enabledIntegrations` value for Github is `GITHUB`.

Github

The Carbon Connect `enabledIntegrations` value for AWS S3 is `S3`.

The Carbon Connect `enabledIntegrations` value for Google Cloud Storage is `GCS`.

Google Cloud Storage

The Carbon Connect `enabledIntegrations` value for RSS is `RSS_FEED`.

The Carbon Connect enabledIntegrations value for websites is `WEB_SCRAPE`.

Websites

Sync embeddings generated from source connectors to any Pinecone instance.

Pinecone

Sync embeddings generated from source connectors to any Turbopuffer instance.

Turbopuffer

Hybrid Search

Use webhooks to notify your application about Carbon events.

Event Types

Filtering

Carbon allows for complete white-labeling of our product.

White Label

An overview of Carbon security features and practices.

Security

Rate Limits

Tutorial

An entity-relationship diagram mapping objects at the API level.

Entity-Relationship Diagram

Migration to Carbon

Carbon simplifies migrations between embedding models.

Embedding Models

Carbon simplifies migrations between vector databases.

Vector Databases

Understand general concepts, response codes, and authentication strategies.

Get Started

This guide provides a breakdown of common errors you might encounter while using the Carbon API. If you encounter an error message not listed here, don't hesitate to contact our support team for assistance.

Errors

Get Access Token

Returns whether or not the organization is white labeled and which integrations are white labeled

:param current_user: the current user
:param db: the database session
:return: a WhiteLabelingResponse

Get White Label

Please note that not all connectors use OAuth, like Gitbook or Freshdesk. Please use the connector specific endpoints for authentication in those cases.

Generate OAuth URL

This endpoint is used to directly upload local files to Carbon. The `POST` request should be a multipart form request.
Note that the `set_page_as_boundary` query parameter is applicable only to PDFs for now. When this value is set,
PDF chunks are at most one page long. Additional information can be retrieved for each chunk, however, namely the coordinates
of the bounding box around the chunk (this can be used for things like text highlighting). Following is a description
of all possible query parameters:
- `chunk_size`: the chunk size (in tokens) applied when splitting the document
- `chunk_overlap`: the chunk overlap (in tokens) applied when splitting the document
- `skip_embedding_generation`: whether or not to skip the generation of chunks and embeddings
- `set_page_as_boundary`: described above
- `embedding_model`: the model used to generate embeddings for the document chunks
- `use_ocr`: whether or not to use OCR as a preprocessing step prior to generating chunks (only valid for PDFs currently)
- `generate_sparse_vectors`: whether or not to generate sparse vectors for the file. Required for hybrid search.
- `prepend_filename_to_chunks`: whether or not to prepend the filename to the chunk text


Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI's
multimodal model; for text, we support OpenAI's `text-embedding-ada-002` and Cohere's embed-multilingual-v3.0.
The model can be specified via the `embedding_model` parameter (in the POST body for `/embeddings`, and a query 
parameter in `/uploadfile`). If no model is supplied, the `text-embedding-ada-002` is used by default. When performing
embedding queries, embeddings from files that used the specified model will be considered in the query.
For example, if files A and B have embeddings generated with `OPENAI`, and files C and D have embeddings generated with
`COHERE_MULTILINGUAL_V3`, then by default, queries will only consider files A and B. If `COHERE_MULTILINGUAL_V3` is
specified as the `embedding_model` in `/embeddings`, then only files C and D will be considered. Make sure that
the set of all files you want considered for a query have embeddings generated via the same model. For now, **do not**
set `VERTEX_MULTIMODAL` as an `embedding_model`. This model is used automatically by Carbon when it detects an image file.

Upload File

Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI's
multimodal model; for text, we support OpenAI's `text-embedding-ada-002` and Cohere's embed-multilingual-v3.0.
The model can be specified via the `embedding_model` parameter (in the POST body for `/embeddings`, and a query 
parameter in `/uploadfile`). If no model is supplied, the `text-embedding-ada-002` is used by default. When performing
embedding queries, embeddings from files that used the specified model will be considered in the query.
For example, if files A and B have embeddings generated with `OPENAI`, and files C and D have embeddings generated with
`COHERE_MULTILINGUAL_V3`, then by default, queries will only consider files A and B. If `COHERE_MULTILINGUAL_V3` is
specified as the `embedding_model` in `/embeddings`, then only files C and D will be considered. Make sure that
the set of all files you want considered for a query have embeddings generated via the same model. For now, **do not**
set `VERTEX_MULTIMODAL` as an `embedding_model`. This model is used automatically by Carbon when it detects an image file.

Upload Text

Upload File via URL

For pre-filtering documents, using `tags_v2` is preferred to using `tags` (which is now deprecated). If both `tags_v2`
and `tags` are specified, `tags` is ignored. `tags_v2` enables
building complex filters through the use of "AND", "OR", and negation logic. Take the below input as an example:
```json
{
    "OR": [
        {
            "key": "subject",
            "value": "holy-bible",
            "negate": false
        },
        {
            "key": "person-of-interest",
            "value": "jesus christ",
            "negate": false
        },
        {
            "key": "genre",
            "value": "religion",
            "negate": true
        }
        {
            "AND": [
                {
                    "key": "subject",
                    "value": "tao-te-ching",
                    "negate": false
                },
                {
                    "key": "author",
                    "value": "lao-tzu",
                    "negate": false
                }
            ]
        }
    ]
}
```
In this case, files will be filtered such that:
1. "subject" = "holy-bible" OR
2. "person-of-interest" = "jesus christ" OR
3. "genre" != "religion" OR
4. "subject" = "tao-te-ching" AND "author" = "lao-tzu"

Note that the top level of the query must be either an "OR" or "AND" array. Currently, nesting is limited to 3.
For tag blocks (those with "key", "value", and "negate" keys), the following typing rules apply:
1. "key" isn't optional and must be a `string`
2. "value" isn't optional and can be `any` or list[`any`]
3. "negate" is optional and must be `true` or `false`. If present and `true`, then the filter block is negated in
the resulting query. It is `false` by default.

View Files

Delete Files V2

Resync File

Get User

Toggle User Features

Delete Users

Update Users

List Users

Submit an URL for web scrape. Set a recursive depth and max number of pages to scrape.

Web Scrape

Extract URLs from a sitemap and perform a web scrape on each URL.

Scrape Sitemap

Return the content and all URLs found on a specific webpage.

Fetch Webpage

View Webpages

Return all URLs found on a specific sitemap.

Fetch Sitemap URLs

Fetch transcripts from YouTube videos in English.

YouTube Transcript

Initialize a web search and return a list of search results.

Web Search

Our RSS connector parses content from web-hosted RSS and Atom feeds (all versions).

Sync RSS Feed

This endpoint revokes the access token for a particular data source connection.

Revoke Connection

This endpoint retrieves active connections across all data sources for a user.

View Connections

You can bypass the authentication flow on Carbon by directly passing in an access token. This endpoint eliminates the need for users to go through the typical authentication flow. By providing the access token directly, users can gain immediate access to Carbon's features and functionality without any additional steps.

Add Connection

This endpoint syncs all items in a user's data source connection. Note that only the directory structure and accompanying metadata will be synced, not the content within files.

Sync Connection

Use this endpoint to access and navigate a user's file directory, allowing you to create a custom file selector interface.

List Connection Items

After listing files and folders via /integrations/items/sync and integrations/items/list, use the selected items' external ids 
as the ids in this endpoint to sync them into Carbon. Sharepoint items take an additional parameter root_id, which identifies
the drive the file or folder is in and is stored in root_external_id. That additional paramter is optional and excluding it will
tell the sync to assume the item is stored in the default Documents drive.

Sync Connection Files

You will need an access token to connect your Gitbook account. Note that the permissions will be defined by the user 
generating access token so make sure you have the permission to access spaces you will be syncing. 
Refer this article for more details https://developer.gitbook.com/gitbook-api/authentication. Additionally, you
need to specify the name of organization you will be syncing data from.

Sync Gitbook Connection

After connecting your Gitbook account, you can use this endpoint to list all of your spaces under current organization.

Get Gitbook Spaces

You can sync upto 20 Gitbook spaces at a time using this endpoint. Additional parameters below can be used to associate 
data with the synced pages or modify the sync behavior.

Sync Gitbook Spaces

Create a new IAM user with permissions to:
<ol>
<li>List all buckets.</li>
<li>Read from the specific buckets and objects to sync with Carbon. Ensure any future buckets or objects carry 
the same permissions.</li>
</ol>
Once created, generate an access key for this user and share the credentials with us. We recommend testing this key beforehand.

Sync S3 Connection

After optionally loading the items via /integrations/items/sync and integrations/items/list, use the bucket name 
and object key as the ID in this endpoint to sync them into Carbon. Additional parameters below can associate 
data with the selected items or modify the sync behavior

Sync S3 Files

Refer this article to obtain an access token https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens.
Make sure that your access token has the permission to read content from your desired repos. Note that if your access token
expires you will need to manually update it through this endpoint.

Sync GitHub Connection

Once you have connected your GitHub account, you can use this endpoint to list the 
    repositories your account has access to. You can use a data source ID or username to fetch from a specific account.

List GitHub Repos

You can retreive repos your token has access to using /integrations/github/repos and sync their content. 
You can also pass full name of any public repository (username/repo-name). This will store the repo content with 
carbon which can be accessed through /integrations/items/list endpoint. Maximum of 25 repositories are accepted per request.

Sync GitHub Repos

Once you have successfully connected your gmail account, you can choose which emails to sync with us
using the filters parameter. Filters is a JSON object with key value pairs. It also supports AND and OR operations.
For now, we support a limited set of keys listed below.

label: Inbuilt Gmail labels, for example "Important" or a custom label you created. 
after or before: A date in YYYY/mm/dd format (example 2023/12/31). Gets emails after/before a certain date.
You can also use them in combination to get emails from a certain period. 
is: Can have the following values - starred, important, snoozed, and unread 
from: Email address of the sender 
to: Email address of the recipient 

Using keys or values outside of the specified values can lead to unexpected behaviour.

An example of a basic query with filters can be
```json
{
 "filters": {
 "key": "label",
 "value": "Test"
 }
}
```
Which will list all emails that have the label "Test".

You can use AND and OR operation in the following way:
```json
{
 "filters": {
 "AND": [
 {
 "key": "after",
 "value": "2024/01/07"
 },
 {
 "OR": [
 {
 "key": "label",
 "value": "Personal"
 },
 {
 "key": "is",
 "value": "starred"
 }
 ]
 }
 ]
 }
}
```
This will return emails after 7th of Jan that are either starred or have the label "Personal". 
Note that this is the highest level of nesting we support, i.e. you can't add more AND/OR filters within the OR filter
in the above example.

Sync Gmail Connection

After connecting your Gmail account, you can use this endpoint to list all of your labels. User created labels
will have the type "user" and Gmail's default labels will have the type "system"

Get Gmail Labels

List all of your public and private channels, DMs, and Group DMs. The ID from response 
can be used as a filter to sync messages to Carbon   
types: Comma separated list of types. Available types are im (DMs), mpim (group DMs), public_channel, and private_channel.
Defaults to public_channel.    
cursor: Used for pagination. If next_cursor is returned in response, you need to pass it as the cursor in the next request    
data_source_id: Data source needs to be specified if you have linked multiple slack accounts  
exclude_archived: Should archived conversations be excluded, defaults to true

Slack List Conversations

You can list all conversations using the endpoint /integrations/slack/conversations. The ID of 
conversation will be used as an input for this endpoint with timestamps as optional filters.

Slack Sync

Once you have successfully connected your Outlook account, you can choose which emails to sync with us
using the filters and folder parameter. "folder" should be the folder you want to sync from Outlook. By default
we get messages from your inbox folder. 
Filters is a JSON object with key value pairs. It also supports AND and OR operations.
For now, we support a limited set of keys listed below.

category: Custom categories that you created in Outlook. 
after or before: A date in YYYY/mm/dd format (example 2023/12/31). Gets emails after/before a certain date. You can also use them in combination to get emails from a certain period. 
is: Can have the following values: flagged 
from: Email address of the sender 

An example of a basic query with filters can be
```json
{
 "filters": {
 "key": "category",
 "value": "Test"
 }
}
```
Which will list all emails that have the category "Test". 

Specifying a custom folder in the same query
```json
{
 "folder": "Folder Name",
 "filters": {
 "key": "category",
 "value": "Test"
 }
}
```

You can use AND and OR operation in the following way:
```json
{
 "filters": {
 "AND": [
 {
 "key": "after",
 "value": "2024/01/07"
 },
 {
 "OR": [
 {
 "key": "category",
 "value": "Personal"
 },
 {
 "key": "category",
 "value": "Test"
 },
 ]
 }
 ]
 }
}
```
This will return emails after 7th of Jan that have either Personal or Test as category. 
Note that this is the highest level of nesting we support, i.e. you can't add more AND/OR filters within the OR filter
in the above example.

Sync Outlook Connection

After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook. This includes 
both system folders like "inbox" and user created folders.

Get Outlook Folders

After connecting your Outlook account, you can use this endpoint to list all of your categories on outlook. We currently
support listing up to 250 categories.

Get Outlook Categories

Get Details

Update Organization

Use this endpoint to reaggregate the statistics for an organization, for example aggregate_file_size. The reaggregation
process is asyncronous so a webhook will be sent with the event type being FILE_STATISTICS_AGGREGATED to notify when the
process is complee. After this aggregation is complete, the updated statistics can be retrieved using the /organization
endpoint. The response of /organization willalso contain a timestamp of the last time the statistics were reaggregated.

Update Statistics

A tag is a key-value pair that can be added to a file. This pair can then be used
for searches (e.g. embedding searches) in order to narrow down the scope of the search.
A file can have any number of tags. The following are reserved keys that cannot be used:
- db_embedding_id
- organization_id
- user_id
- organization_user_file_id

Carbon currently supports two data types for tag values - `string` and `list<string>`.
Keys can only be `string`. If values other than `string` and `list<string>` are used,
they're automatically converted to strings (e.g. 4 will become "4").

Create File Tags

Delete File Tags

List Chunks and Embeddings

Upload Chunks / Embeddings

For pre-filtering documents, using `tags_v2` is preferred to using `tags` (which is now deprecated). If both `tags_v2`
and `tags` are specified, `tags` is ignored. `tags_v2` enables
building complex filters through the use of "AND", "OR", and negation logic. Take the below input as an example:
```json
{
    "OR": [
        {
            "key": "subject",
            "value": "holy-bible",
            "negate": false
        },
        {
            "key": "person-of-interest",
            "value": "jesus christ",
            "negate": false
        },
        {
            "key": "genre",
            "value": "religion",
            "negate": true
        }
        {
            "AND": [
                {
                    "key": "subject",
                    "value": "tao-te-ching",
                    "negate": false
                },
                {
                    "key": "author",
                    "value": "lao-tzu",
                    "negate": false
                }
            ]
        }
    ]
}
```
In this case, files will be filtered such that:
1. "subject" = "holy-bible" OR
2. "person-of-interest" = "jesus christ" OR
3. "genre" != "religion" OR
4. "subject" = "tao-te-ching" AND "author" = "lao-tzu"

Note that the top level of the query must be either an "OR" or "AND" array. Currently, nesting is limited to 3.
For tag blocks (those with "key", "value", and "negate" keys), the following typing rules apply:
1. "key" isn't optional and must be a `string`
2. "value" isn't optional and can be `any` or list[`any`]
3. "negate" is optional and must be `true` or `false`. If present and `true`, then the filter block is negated in
the resulting query. It is `false` by default.


When querying embeddings, you can optionally specify the `media_type` parameter in your request. By default (if
not set), it is equal to "TEXT". This means that the query will be performed over files that have
been parsed as text (for now, this covers all files except image files). If it is equal to "IMAGE",
the query will be performed over image files (for now, `.jpg` and `.png` files). You can think of this
field as an additional filter on top of any filters set in `file_ids` and


When `hybrid_search` is set to true, a combination of keyword search and semantic search are used to rank
and select candidate embeddings during information retrieval. By default, these search methods are weighted
equally during the ranking process. To adjust the weight (or "importance") of each search method, you can use
the `hybrid_search_tuning_parameters` property. The description for the different tuning parameters are:
- `weight_a`: weight to assign to semantic search
- `weight_b`: weight to assign to keyword search

You must ensure that `sum(weight_a, weight_b,..., weight_n)` for all *n* weights is equal to 1. The equality
has an error tolerance of 0.001 to account for possible floating point issues.

In order to use hybrid search for a customer across a set of documents, two flags need to be enabled:
1. Use the `/modify_user_configuration` endpoint to to enable `sparse_vectors` for the customer. The payload
body for this request is below:
```
{
  "configuration_key_name": "sparse_vectors",
  "value": {
    "enabled": true
  }
}
```
2. Make sure hybrid search is enabled for the documents across which you want to perform the search. For the
`/uploadfile` endpoint, this can be done by setting the following query parameter: `generate_sparse_vectors=true`


Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI's
multimodal model; for text, we support OpenAI's `text-embedding-ada-002` and Cohere's embed-multilingual-v3.0.
The model can be specified via the `embedding_model` parameter (in the POST body for `/embeddings`, and a query 
parameter in `/uploadfile`). If no model is supplied, the `text-embedding-ada-002` is used by default. When performing
embedding queries, embeddings from files that used the specified model will be considered in the query.
For example, if files A and B have embeddings generated with `OPENAI`, and files C and D have embeddings generated with
`COHERE_MULTILINGUAL_V3`, then by default, queries will only consider files A and B. If `COHERE_MULTILINGUAL_V3` is
specified as the `embedding_model` in `/embeddings`, then only files C and D will be considered. Make sure that
the set of all files you want considered for a query have embeddings generated via the same model. For now, **do not**
set `VERTEX_MULTIMODAL` as an `embedding_model`. This model is used automatically by Carbon when it detects an image file.

Model	Developer	Compression Factor	Embedding Size	Average MTEB Score	Carbon Slug
ada v2	OpenAI	-	1536	61.0	`OPENAI`
text-embedding-3-small	OpenAI	-	512	61.6	`OPENAI_ADA_SMALL_512`
	OpenAI	-	1536	62.3	`OPENAI_ADA_SMALL_1536`
text-embedding-3-large	OpenAI	-	256	62.0	`OPENAI_ADA_LARGE_256`
	OpenAI	-	1024	64.1	`OPENAI_ADA_LARGE_1024`
	OpenAI	-	3072	64.6	`OPENAI_ADA_LARGE_3072`
Cohere Embed v3 Multilingual	Cohere	-	1024	64.0	`COHERE_MULTILINGUAL_V3`
	Cohere	int8	1024	-	Launching soon
	Cohere	binary	1024	-	Launching soon
Solar Embeddings	Upstage	-	4096	-	`SOLAR_1_MINI`
jina-embeddings-v2	Jina	-	768	60.4	Launching soon

Model	Developer	Carbon Slug
jina-reranker-v2-base-multilingual	Jina AI	`JINA_MULTILINGUAL_BASE_V2`
Cohere Rerank 3 Multilingual	Cohere	`COHERE_RERANK_MULTILINGUAL_V3`

Get Started

Source Connectors

Destination Connectors

Learn

Resources

Migrations

Models

Overview

Supported Models

Text Embeddings

Reranking Models

Image Embeddings

Video Embeddings

Usage

Get Started

Source Connectors

Destination Connectors

Learn

Resources

Migrations

​Overview

​Supported Models

​Text Embeddings

​Reranking Models

​Image Embeddings

​Video Embeddings

​Usage

Overview

Supported Models

Text Embeddings

Reranking Models

Image Embeddings

Video Embeddings

Usage