Functionality

Our web scraper executes Puppeteer jobs on a full-featured Chromium web browser, enabling the scraping of dynamic content through client-side JavaScript. This robust approach ensures effective handling of websites with interactive and dynamically rendered content.

Carbon’s web scraper supports recursive scraping, allowing for a depth-first exploration of web pages.

To enhance anonymity and avoid detection, each request is associated with a different IP address, and we regularly rotate these IPs upon detection of potential flags. This strategy helps maintain a discreet and resilient scraping process.

Synchronization

Syncs are triggered when end-users re-submits an URL via the web_scrape endpoint or Carbon Connect. You can also use the resync_file API endpoint to programmatically resync specific web pages. To delete websites from Carbon, you can use the delete_files endpoint directly.

To sync website on a 24-hour schedule (more frequent schedules available upon request), you can use the /update_users endpoint. This endpoint allows organizations to customize syncing settings according to their requirements, with the option to enable syncing for all data sources using the string ‘ALL’. It’s important to note that each request supports up to 100 customer IDs.

Here’s an example illustrating how to automatically enable syncing for updated website content for specified users:

{
    "customer_ids": ["team@carbon.ai", "sam@openai.com"],
    "auto_sync_enabled_sources": ["WEB_SCRAPE"]
}