Web Scrape
Scrape Sitemap
Extract URLs from a sitemap and perform a web scrape on each URL.
POST
/
scrape_sitemap
Authorizations
authorization
string
headerrequiredtoken <token>
, corresponds to temporary access tokens.
Body
application/json
url
string
requiredtags
object | null
max_pages_to_scrape
integer | null
chunk_size
integer | null
chunk_overlap
integer | null
skip_embedding_generation
boolean | null
enable_auto_sync
boolean | null
generate_sparse_vectors
boolean | null
prepend_filename_to_chunks
boolean | null
html_tags_to_skip
string[] | null
css_classes_to_skip
string[] | null
css_selectors_to_skip
string[] | null
embedding_model
enum<string>
Available options:
OPENAI
, AZURE_OPENAI
, AZURE_ADA_LARGE_256
, AZURE_ADA_LARGE_1024
, AZURE_ADA_LARGE_3072
, AZURE_ADA_SMALL_512
, AZURE_ADA_SMALL_1536
, COHERE_MULTILINGUAL_V3
, VERTEX_MULTIMODAL
, OPENAI_ADA_LARGE_256
, OPENAI_ADA_LARGE_1024
, OPENAI_ADA_LARGE_3072
, OPENAI_ADA_SMALL_512
, OPENAI_ADA_SMALL_1536
, SOLAR_1_MINI
url_paths_to_include
string[] | null
URL subpaths or directories that you want to include. For example if you want to only include URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input
url_paths_to_exclude
string[] | null
URL subpaths or directories that you want to exclude. For example if you want to exclude URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input
urls_to_scrape
string[] | null
You can submit a subset of URLs from the sitemap that should be scraped. To get the list of URLs, you can check out /process_sitemap endpoint. If left empty, all URLs from the sitemap will be scraped.
Response
200 - application/json
The response is of type any
.