This article describes the options for getting the dataset once the scraper’s code is ready, including saving to production, initiate the scraper, delivery preferences, and output schema
When writing a scraper code on the IDE, the system auto-saves the scraper as a draft to the development environment. From inside the IDE, you can run one page at a time to sample how your scraper will behave. To get a full production run, you need to save scraper to production by clicking the ‘Save to production’ button at the top right corner of the IDE screen. All scrapers will appear under the ‘My scrapers’ tab in the control panel. Any inactive collector will be shown in a faded state.
Initiate Collector
To start collecting the data, choose one of three options:
A. Initiate by API
B. Initiate manually
C. Schedule a collector
Initiate by API
You can start a data collection through API without accessing the Bright Data control panel : Getting started with API documentation
Before initiating an API request, please create an API token. To create an API token, go to:
Dashboard side menu settings > account settings > API tokens
1. Set Up Inputs Manually - provide input manually or through the API request
2. Trigger behavior - you can add several requests in our system can process in parallel. You can queue multiple crawls to run in parallel.
3. Preview of the API Request - Bright Data provides you with a REST API call to initiate the collector. Please select the "Linux Bash" viewer for CURL commands. As soon as you send the request, you will receive a job id.
Initiate Manually
Bright Data's control panel makes it easy to get started collecting data.
1. Trigger behavior - you can add several requests in parallel that are activated according to the order they're defined. You can add another job run to the queue and run more than two jobs simultaneously.
2. Set up inputs manually.
3. Upload CSV file - If you’d like to add a large amount of input, the easiest way is to add them to a CSV file and upload it to the system. For example, a list of URLs.
See the example provided for reference.
Schedule Configuration
Choose when to initiate the collector.
Step One :
1. Choose a date and time for the collector to start.
2. Select the frequency it will run (hourly, daily, weekly, etc.)
3. Set a deadline for when a collector is complete.
4. Review your setup.
Delivery preferences
You can set your delivery preferences for the dataset. To do that simply click on the collector row from the ‘My scrapers’ tab and then click on ‘Delivery preferences’
-
Choose when to get the data :
-
Batch : an efficient way of managing large amounts of data
-
Split batch : deliver the data in smaller batches as soon as it's ready
-
Note: when using this setting, filenames will append numbers and possibly a period to distinguish records. Example:
Full batch:
Streaming: - Note: if splitting into success & error records, the same indicator strings will be included in the streaming results.
-
Note: when using this setting, filenames will append numbers and possibly a period to distinguish records. Example:
-
Split batch : deliver the data in smaller batches as soon as it's ready
-
Real-time : is an ideal way to get a fast response for one request
- Skip retries : Do not retry when error occurs. Can speed up collection
-
Batch : an efficient way of managing large amounts of data
-
Choose file format :
- JSON
- NDJSON
- CSV
- XLSX
-
Choose how to receive the data :
- API Download
- Webhook
- Cloud storage providers : Amazon S3, Google Cloud Storage, Azure
- SFTP/FTP
- Note : Media files cannot be delivered when it's set to Email or API download
-
Choose result format :
- Result and Errors in separate files
- Result and Errors together in one file
- Only successful results
- Only errors
-
Define notifications
- Notify when the collection is complete
- Notify success rates
- Notify when an error occurs
Output schema
Schema defines the data point structure and how the data will be organized. You can change the schema structure and modify the data points to suit your needs, re-order, set default values, and add additional data to your output configuration. You can add new field names by going into the advanced settings and editing the code.
- Input / Output schema : choose the tab you’d like to configure
- Custom validation : validate the schema
- Parsed data : data points collected by the collector
- Add new field : in case you need additional data point, you can add fields and define field name and type
- Additional data : additional information you can add to the schema (timestamp, HTML, screenshot, etc.)