Scrape Sitemap
Extract URLs from a sitemap and perform a web scrape on each URL.
Authorizations
token <token>
, corresponds to temporary access tokens.
Body
x > 1
OPENAI
, AZURE_OPENAI
, AZURE_ADA_LARGE_256
, AZURE_ADA_LARGE_1024
, AZURE_ADA_LARGE_3072
, AZURE_ADA_SMALL_512
, AZURE_ADA_SMALL_1536
, COHERE_MULTILINGUAL_V3
, VERTEX_MULTIMODAL
, OPENAI_ADA_LARGE_256
, OPENAI_ADA_LARGE_1024
, OPENAI_ADA_LARGE_3072
, OPENAI_ADA_SMALL_512
, OPENAI_ADA_SMALL_1536
, SOLAR_1_MINI
URL subpaths or directories that you want to include. For example if you want to only include URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input
URL subpaths or directories that you want to exclude. For example if you want to exclude URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input
You can submit a subset of URLs from the sitemap that should be scraped. To get the list of URLs, you can check out /process_sitemap endpoint. If left empty, all URLs from the sitemap will be scraped.
Whether the scraper should download css and media from the page (images, fonts, etc). Scrapes might take longer to finish with this flag enabled, but the success rate is improved.
If this flag is enabled, the file will be chunked and stored with Carbon, but no embeddings will be generated. This overrides the skip_embedding_generation flag.
If this flag is enabled, the file will be stored with Carbon, but no processing will be done.
If the default proxies are blocked and not returning results, this flag can be enabled to use alternate proxies (residential and office). Scrapes might take longer to finish with this flag enabled.
Response
The response is of type any
.