POST
/
scrape_sitemap

Authorizations

authorization
string
headerrequired

token <token>, corresponds to temporary access tokens.

Body

application/json
url
string
required
tags
object | null
max_pages_to_scrape
integer | null
chunk_size
integer | null
default: 1500
chunk_overlap
integer | null
default: 20
skip_embedding_generation
boolean | null
default: false
enable_auto_sync
boolean | null
default: false
generate_sparse_vectors
boolean | null
default: false
prepend_filename_to_chunks
boolean | null
default: false
html_tags_to_skip
string[] | null
css_classes_to_skip
string[] | null
css_selectors_to_skip
string[] | null
embedding_model
enum<string>
default: OPENAI
Available options:
OPENAI,
AZURE_OPENAI,
AZURE_ADA_LARGE_256,
AZURE_ADA_LARGE_1024,
AZURE_ADA_LARGE_3072,
AZURE_ADA_SMALL_512,
AZURE_ADA_SMALL_1536,
COHERE_MULTILINGUAL_V3,
VERTEX_MULTIMODAL,
OPENAI_ADA_LARGE_256,
OPENAI_ADA_LARGE_1024,
OPENAI_ADA_LARGE_3072,
OPENAI_ADA_SMALL_512,
OPENAI_ADA_SMALL_1536,
SOLAR_1_MINI
url_paths_to_include
string[] | null

URL subpaths or directories that you want to include. For example if you want to only include URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input

url_paths_to_exclude
string[] | null

URL subpaths or directories that you want to exclude. For example if you want to exclude URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input

urls_to_scrape
string[] | null

You can submit a subset of URLs from the sitemap that should be scraped. To get the list of URLs, you can check out /process_sitemap endpoint. If left empty, all URLs from the sitemap will be scraped.

download_css_and_media
boolean | null
default: false

Whether the scraper should download css and media from the page (images, fonts, etc). Scrapes might take longer to finish with this flag enabled, but the success rate is improved.

Response

200 - application/json

The response is of type any.