We are thrilled to introduce a new feature in our Marketplace Dataset API, which is designed to enrich your data acquisition process. This update allows for a more granular and streamlined way to request and manage your data collections, facilitating more effective dataset generation according to your specific needs.
Understanding When to Use Each API:
Initial Collection Without Customer-Defined View:
The 3 primary API endpoints serve distinct purposes in the data collection workflow, facilitating a structured and efficient process in obtaining tailored datasets.
- Requesting a Collection: The initial step to define what data you want to collect. This is used when specifying the parameters for a new dataset collection like defining the dataset ID and type or when discovering new data or specifying URL collections.
- Checking Status of the Collection: This step is about querying information on an existing data collection request to understand its current state. It's used when checking the status or details of a collection request, such as the total number of lines, the freshness of data, or other pertinent details.
- Initiating a Collection: This step triggers the data collection, transitioning from the request phase to the actual data gathering phase. It's used after defining the collection parameters and when ready to start the data collection process.
Collection After Defining a View:
- Requesting a Collection: Only define what data you want to collect.
Solution:
Initial Collection Without Customer-Defined View:
The new feature encapsulates 3 primary endpoints that have been introduced to cater to different stages of data collection:
Requesting a Collection:
-
- Endpoint: `POST https://api.brightdata.com/datasets/request_collection`
- Parameters:
- `dataset_id` (required)
- `type` (required): `discover_new` OR `url_collection`
- Body: `inputs` (Array - json), `file` (multipart - csv)
Example:
curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -k -d '[{"id":"user-id"}]' "https://api.brightdata.com/datasets/request_collection?dataset_id=gd_l1viktl72bvl7bjuj0&type=discover_new"
Processing may take several minutes, based on the number of inputs. When you request to discover ('discover_new'), finding all links (PDPs) may take time.
Checking Status of the Collection Above:
- Endpoint: `GET https://api.brightdata.com/datasets/request_collection`
- Parameters:
- `request_id` (required) - Obtain from the previous API.
- `freshness_ms` (required) - Sets data freshness. If data is within this period (e.g., requested 1 week, collected 5 days ago), no new scrape occurs. If data is not fresh, we scrape it now.
- 1 week: 604,800,000 ms
- 1 month: 2,592,000,000 ms
example for freshness:
Example:
curl -H "Authorization: Bearer API_TOKEN" -k "https://api.brightdata.com/datasets/request_collection?request_id=REQUEST_ID&freshness_ms=2592000000"
Response Indicating Number of Records and Freshness Found:
{ dataset_id: request_job.dataset_id, total_lines: 100, fresh_count: 30, name: "linkedin_companies custom input", status: "done", request_id: "XXXX", }
The request is still running:
{ total_lines: 100, status: "running", }
Issue with one (or more) inputs: in this case the url was sent as URL
{ request_id: "xxxx",
error: "Validation failed", error_code: "validation", validation_errors: [{ line: "{\"URL\":\"https://www.tiktok.com/search?q=tjd\"}", index: 1, errors: [ ["url", "Required field"] ] }]
}
Initiating a Collection:
- Endpoint: `POST https://api.brightdata.com/datasets/initiate_collection`
- Parameters:
- request_id (required): The unique identifier for the collection request you are inquiring about.
- freshness_ms (required): The time in milliseconds indicating the desired data freshness.
- Body: `Request_id` (required), `Freshness_ms` (required)
Example:
curl -X POST -H "Authorization: Bearer API_TOKEN" -H "content-type: application/json" -k "https://api.brightdata.com/datasets/initiate_collection" -d '{"request_id":"j_ln2x567b2961de0d1x","freshness_ms":2592000000}'
Collection After Defining a View:
Initiating a Collection:
- Endpoint: `POST https://api.brightdata.com/datasets/initiate`
- Parameters:
- dataset_id (required)
- view (required)
- `type` (required): `discover_new` OR `url_collection`
- Body: `inputs` (Array - json), `file` (multipart - csv)
Example:
curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -k -d '[{"id":"user-id"}]' "https://api.brightdata.com/datasets/initiate?dataset_id=XXX_DATASET_ID&type=url_collection&view=XXX_VIEW_ID"
Dataset will be delivered to the setting configured for this view.
By leveraging these enhanced capabilities, users can now tailor their data collection processes more efficiently, ensuring that the datasets generated are aligned with their project requirements.