Scrape Sitemap

POST

scrape_sitemap

curl --request POST \
  --url https://api.carbon.ai/scrape_sitemap \
  --header 'Content-Type: application/json' \
  --header 'authorization: <api-key>' \
  --data '{
  "url": "<string>",
  "tags": {},
  "max_pages_to_scrape": 2,
  "chunk_size": 123,
  "chunk_overlap": 123,
  "skip_embedding_generation": true,
  "enable_auto_sync": true,
  "generate_sparse_vectors": true,
  "prepend_filename_to_chunks": true,
  "html_tags_to_skip": [
    "<string>"
  ],
  "css_classes_to_skip": [
    "<string>"
  ],
  "css_selectors_to_skip": [
    "<string>"
  ],
  "embedding_model": "OPENAI",
  "url_paths_to_include": [
    "<string>"
  ],
  "url_paths_to_exclude": [
    "<string>"
  ],
  "urls_to_scrape": [
    "<string>"
  ],
  "download_css_and_media": true,
  "generate_chunks_only": false,
  "store_file_only": false,
  "use_premium_proxies": false
}'

"<any>"

Authorizations

authorization

string

header

required

token <token>, corresponds to temporary access tokens.

Body

application/json

url

string

required

tags

object | null

max_pages_to_scrape

integer | null

Required range: x >= 1

chunk_size

integer | null

default:1500

chunk_overlap

integer | null

default:20

skip_embedding_generation

boolean | null

default:false

enable_auto_sync

boolean | null

default:false

generate_sparse_vectors

boolean | null

default:false

prepend_filename_to_chunks

boolean | null

default:false

html_tags_to_skip

string[] | null

css_classes_to_skip

string[] | null

css_selectors_to_skip

string[] | null

embedding_model

enum<string>

Available options:

OPENAI,

AZURE_OPENAI,

AZURE_ADA_LARGE_256,

AZURE_ADA_LARGE_1024,

AZURE_ADA_LARGE_3072,

AZURE_ADA_SMALL_512,

AZURE_ADA_SMALL_1536,

COHERE_MULTILINGUAL_V3,

VERTEX_MULTIMODAL,

OPENAI_ADA_LARGE_256,

OPENAI_ADA_LARGE_1024,

OPENAI_ADA_LARGE_3072,

OPENAI_ADA_SMALL_512,

OPENAI_ADA_SMALL_1536,

SOLAR_1_MINI

url_paths_to_include

string[] | null

URL subpaths or directories that you want to include. For example if you want to only include URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input

url_paths_to_exclude

string[] | null

URL subpaths or directories that you want to exclude. For example if you want to exclude URLs that start with /questions in stackoverflow.com, you will add /questions/ in this input

urls_to_scrape

string[] | null

You can submit a subset of URLs from the sitemap that should be scraped. To get the list of URLs, you can check out /process_sitemap endpoint. If left empty, all URLs from the sitemap will be scraped.

download_css_and_media

boolean | null

default:false

Whether the scraper should download css and media from the page (images, fonts, etc). Scrapes might take longer to finish with this flag enabled, but the success rate is improved.

generate_chunks_only

boolean

default:false

If this flag is enabled, the file will be chunked and stored with Carbon, but no embeddings will be generated. This overrides the skip_embedding_generation flag.

store_file_only

boolean

default:false

If this flag is enabled, the file will be stored with Carbon, but no processing will be done.

use_premium_proxies

boolean

default:false

If the default proxies are blocked and not returning results, this flag can be enabled to use alternate proxies (residential and office). Scrapes might take longer to finish with this flag enabled.

Response

200

application/json

Successful Response

The response is of type any.

Web Scrape Fetch Webpage

curl --request POST \
  --url https://api.carbon.ai/scrape_sitemap \
  --header 'Content-Type: application/json' \
  --header 'authorization: <api-key>' \
  --data '{
  "url": "<string>",
  "tags": {},
  "max_pages_to_scrape": 2,
  "chunk_size": 123,
  "chunk_overlap": 123,
  "skip_embedding_generation": true,
  "enable_auto_sync": true,
  "generate_sparse_vectors": true,
  "prepend_filename_to_chunks": true,
  "html_tags_to_skip": [
    "<string>"
  ],
  "css_classes_to_skip": [
    "<string>"
  ],
  "css_selectors_to_skip": [
    "<string>"
  ],
  "embedding_model": "OPENAI",
  "url_paths_to_include": [
    "<string>"
  ],
  "url_paths_to_exclude": [
    "<string>"
  ],
  "urls_to_scrape": [
    "<string>"
  ],
  "download_css_and_media": true,
  "generate_chunks_only": false,
  "store_file_only": false,
  "use_premium_proxies": false
}'

"<any>"

API Documentation

Health

Auth

Files

User

Web Scrape

Data Source

Gitbook

S3

SharePoint

GitHub

Gmail

Slack

Outlook

Organizations

Tags

Chunks / Embeddings

Retrieval

Webhooks

White Labeling

CRM

Scrape Sitemap

Authorizations

Body

Response