POST
/
web_scrape

Authorizations

authorization
string
headerrequired

token <token>, corresponds to temporary access tokens.

Body

application/json · object[]
url
string
required
tags
object | null
recursion_depth
integer | null
default: 3
Required range: x > 0
max_pages_to_scrape
integer | null
default: 100
Required range: x > 1
chunk_size
integer | null
default: 1500
chunk_overlap
integer | null
default: 20
skip_embedding_generation
boolean | null
default: false
enable_auto_sync
boolean | null
default: false
generate_sparse_vectors
boolean | null
default: false
prepend_filename_to_chunks
boolean | null
default: false
html_tags_to_skip
string[] | null
css_classes_to_skip
string[] | null
css_selectors_to_skip
string[] | null
embedding_model
enum<string>
Available options:
OPENAI,
AZURE_OPENAI,
AZURE_ADA_LARGE_256,
AZURE_ADA_LARGE_1024,
AZURE_ADA_LARGE_3072,
AZURE_ADA_SMALL_512,
AZURE_ADA_SMALL_1536,
COHERE_MULTILINGUAL_V3,
VERTEX_MULTIMODAL,
OPENAI_ADA_LARGE_256,
OPENAI_ADA_LARGE_1024,
OPENAI_ADA_LARGE_3072,
OPENAI_ADA_SMALL_512,
OPENAI_ADA_SMALL_1536,
SOLAR_1_MINI
url_paths_to_include
string[] | null

URL subpaths or directories that you want to include. For example if you want to only include URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input

download_css_and_media
boolean | null
default: false

Whether the scraper should download css and media from the page (images, fonts, etc). Scrapes might take longer to finish with this flag enabled, but the success rate is improved.

generate_chunks_only
boolean
default: false

If this flag is enabled, the file will be chunked and stored with Carbon, but no embeddings will be generated. This overrides the skip_embedding_generation flag.

store_file_only
boolean
default: false

If this flag is enabled, the file will be stored with Carbon, but no processing will be done.

use_premium_proxies
boolean
default: false

If the default proxies are blocked and not returning results, this flag can be enabled to use alternate proxies (residential and office). Scrapes might take longer to finish with this flag enabled.

Response

200 - application/json

The response is of type any.