Search
For pre-filtering documents, using tags_v2
is preferred to using tags
(which is now deprecated). If both tags_v2
and tags
are specified, tags
is ignored. tags_v2
enables
building complex filters through the use of “AND”, “OR”, and negation logic. Take the below input as an example:
{
"OR": [
{
"key": "subject",
"value": "holy-bible",
"negate": false
},
{
"key": "person-of-interest",
"value": "jesus christ",
"negate": false
},
{
"key": "genre",
"value": "religion",
"negate": true
}
{
"AND": [
{
"key": "subject",
"value": "tao-te-ching",
"negate": false
},
{
"key": "author",
"value": "lao-tzu",
"negate": false
}
]
}
]
}
In this case, files will be filtered such that:
- “subject” = “holy-bible” OR
- “person-of-interest” = “jesus christ” OR
- “genre” != “religion” OR
- “subject” = “tao-te-ching” AND “author” = “lao-tzu”
Note that the top level of the query must be either an “OR” or “AND” array. Currently, nesting is limited to 3. For tag blocks (those with “key”, “value”, and “negate” keys), the following typing rules apply:
- “key” isn’t optional and must be a
string
- “value” isn’t optional and can be
any
or list[any
] - “negate” is optional and must be
true
orfalse
. If present andtrue
, then the filter block is negated in the resulting query. It isfalse
by default.
When querying embeddings, you can optionally specify the media_type
parameter in your request. By default (if
not set), it is equal to “TEXT”. This means that the query will be performed over files that have
been parsed as text (for now, this covers all files except image files). If it is equal to “IMAGE”,
the query will be performed over image files (for now, .jpg
and .png
files). You can think of this
field as an additional filter on top of any filters set in file_ids
and
When hybrid_search
is set to true, a combination of keyword search and semantic search are used to rank
and select candidate embeddings during information retrieval. By default, these search methods are weighted
equally during the ranking process. To adjust the weight (or “importance”) of each search method, you can use
the hybrid_search_tuning_parameters
property. The description for the different tuning parameters are:
weight_a
: weight to assign to semantic searchweight_b
: weight to assign to keyword search
You must ensure that sum(weight_a, weight_b,..., weight_n)
for all n weights is equal to 1. The equality
has an error tolerance of 0.001 to account for possible floating point issues.
In order to use hybrid search for a customer across a set of documents, two flags need to be enabled:
- Use the
/modify_user_configuration
endpoint to to enablesparse_vectors
for the customer. The payload body for this request is below:
{
"configuration_key_name": "sparse_vectors",
"value": {
"enabled": true
}
}
- Make sure hybrid search is enabled for the documents across which you want to perform the search. For the
/uploadfile
endpoint, this can be done by setting the following query parameter:generate_sparse_vectors=true
Carbon supports multiple models for use in generating embeddings for files. For images, we support Vertex AI’s
multimodal model; for text, we support OpenAI’s text-embedding-ada-002
and Cohere’s embed-multilingual-v3.0.
The model can be specified via the embedding_model
parameter (in the POST body for /embeddings
, and a query
parameter in /uploadfile
). If no model is supplied, the text-embedding-ada-002
is used by default. When performing
embedding queries, embeddings from files that used the specified model will be considered in the query.
For example, if files A and B have embeddings generated with OPENAI
, and files C and D have embeddings generated with
COHERE_MULTILINGUAL_V3
, then by default, queries will only consider files A and B. If COHERE_MULTILINGUAL_V3
is
specified as the embedding_model
in /embeddings
, then only files C and D will be considered. Make sure that
the set of all files you want considered for a query have embeddings generated via the same model. For now, do not
set VERTEX_MULTIMODAL
as an embedding_model
. This model is used automatically by Carbon when it detects an image file.
Authorizations
token <token>
, corresponds to temporary access tokens.
Body
Number of related chunks to return.
x > 1
Query for which to get related chunks and embeddings.
1
Embedding model that should be used to embed the query. For this to be effective, the files being searched must also have embeddings in Carbon that were generated by the same embedding model.
OPENAI
, AZURE_OPENAI
, AZURE_ADA_LARGE_256
, AZURE_ADA_LARGE_1024
, AZURE_ADA_LARGE_3072
, AZURE_ADA_SMALL_512
, AZURE_ADA_SMALL_1536
, COHERE_MULTILINGUAL_V3
, VERTEX_MULTIMODAL
, OPENAI_ADA_LARGE_256
, OPENAI_ADA_LARGE_1024
, OPENAI_ADA_LARGE_3072
, OPENAI_ADA_SMALL_512
, OPENAI_ADA_SMALL_1536
, SOLAR_1_MINI
Flag to control whether or not to exclude files that are not in hot storage. If set to False, then an error will be returned if any filtered files are in cold storage.
Optional list of file IDs to limit the search to
Filter files based on their type at the source (for example help center tickets and articles)
TICKET
, ARTICLE
, CONVERSATION
Flag to control whether or not to perform a high accuracy embedding search. By default, this is set to false. If true, the search may return more accurate results, but may take longer to complete.
Flag to control whether or not to perform hybrid search.
Hybrid search tuning parameters. See the endpoint description for more details.
Flag to control whether or not to include all children of filtered files in the embedding search.
Flag to control whether or not to include file-level metadata in the response. This metadata
will be included in the content_metadata
field of each document along with chunk/embedding level metadata.
Flag to control whether or not to include a signed URL to the raw file containing each chunk in the response.
Flag to control whether or not to include tags for each chunk in the response.
Flag to control whether or not to include embedding vectors in the response.
Used to filter the kind of files (e.g. TEXT
or IMAGE
) over which to perform the search. Also
plays a role in determining what embedding model is used to embed the query. If IMAGE
is chosen as the media type,
then the embedding model used will be an embedding model that is not text-only, regardless of what value is passed
for embedding_model
.
TEXT
, IMAGE
, AUDIO
, VIDEO
Optional list of parent file IDs to limit the search to. A parent file describes a file to which another file belongs (e.g. a folder)
Optional query vector for which to get related chunks and embeddings. It must have been
generated by the same model used to generate the embeddings across which the search is being conducted. Cannot
provide both query
and query_vector
.
Parameters for reranking the chunks using a specified model. This field accepts an object with details of the reranking model to be used; either 'Cohere' or 'Jina'. If provided, the specified reranking model will reorder the retrieved chunks based on their relevance to the query.
A set of tags to limit the search to. Deprecated and may be removed in the future.
A set of tags to limit the search to. Use this instead of tags
, which is deprecated.