This article lists and explains use cases which scraping code has to deal with.
Note: Based on the challenge you’re facing and the action you need to perform, you might need to change the worker type of your collector. Read more about the importance of the worker type.
Action: Set the collector to run from a specific geo location
Worker: Browser Worker or Code Worker
Part: Interaction Code
Solution: Specify 2-character ISO country code, or proxy location
Function: country()
Code example:
country('US')
Action: Navigate to the initial page
Worker: Browser Worker or Code Worker
Part: Interaction Code
Solution: Hard code the URL or use an input variable for the URL or part of it
Function: navigate()
Code example:
navigate('https://google.com')
navigate(input.url)
Action: Close a popup
Worker: Browser Worker only
Part: Interaction Code
Solution: Find the popup in advanced and use the function to close it (add it at the top of your code)
Function: close_popup()
Code example:
close_popup('#onetrust-banner-sdk', '#onetrust-accept-btn-handler')
Action: Use the website's search function (if URL can’t be used for that)
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: navigate to the field > type the search term > click a button to search
Function: navigate() + wait() + type() + click()
Code example:
navigate('https://www.zillow.com/')
type('#search-box-input', 'Los Angeles, CA')
click('#search-icon')
Action: Find the total number of results pages
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: Identify the element and parse the content
Function: parse()
Code example:
return {
pages: $('[class*="Pagination-module__PageNumber__text"]').toArray().map(v => $(v).text().trim()).pop()
}
Action: Create list of pagination URLs and send them to the next stage
Worker: Both
Part: Interaction Code
Solution: Detect and parse max page number, then in the loop update URL's query parameter 'page' and send to the next stage each one
Function: next_stage()
Code example:
const {pages} = parse()
for(let i = 1; i <= pages; i++) {
let url = new URL(location.href)
url.searchParams.set('page', i)
next_stage({
url
})
}
Action: Scrape list of URLs and send them to the next stage
Worker: Both
Part: Interaction Code
Solution: Detect needed urls from the page, parse them and send to the next stage
Function: next_stage()
Code example:
const {items} = parse()
items.forEach(i => {
next_stage({url: i})
})
Action: Find all urls to all relevant pages
Worker: Browser Worker or Code Worker
Part: Parser Code
Solution: Parse the results page to get all links
Function: parse()
Code example:
return {
pages: $('.pagination-paginationMeta').text().split(' ').pop(),
links: $('li.product-base a[target="_blank"]').toArray().map(e => new URL($(e).attr('href'), location.href))
}
Action: Page has inner options which affects the data on the page
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: For a small amount of variations (~5) use a loop to go one by one through all variations. For larger amount you should break it down in to sessions and not run all on the same one
Function: click() | rerun_stage()
Code example:
if(input.is_rerun && input.size_index) {
wait('.size-item')
$('.size-item').eq(input.size_index).click()
wait('.price-amount')
}
if(sizes.length <= 5 && sizes.length > 1) {
let prices = []
for(let i = 0; i <= sizes.length; i++) {
$('.size-item').eq(i).click()
wait('.price-amount')
prices.push(parse().price)
}
} else if(sizes.length > 5){
sizes.forEach(size => {
rerun_stage({
url: location.href,
is_rerun: true,
size_index: size.index
})
})
}
Action: Parse the data on the page
Worker: Browser Worker or Code Worker
Part: Parser Code
Solution: Define selectors and use parse function
Function: parse()
Code example:
const {items, pages} = parse()
Action: Collect the required data on a defined page
Worker: Browser Worker or Code Worker
Part: Interaction Code
Solution: Define the required data in the output schema and use the collect function in the code to collect the outputs
Function: collect()
Code example:
const {data} = parse()
collect({
product_name: data.name,
price: data.price
})
Action: Collect downloaded files
Worker: Browser Worker only
Part: Interaction Code or Parser Code
Solution: For image or video use the parse function, or teg the image/video/other file
Function: parse() | tag_image() | tag_video() | tag_download()
Code example:
tag_video('video', '#product-video', {download: true})
tag_image('image', '#product-image')
let download = tag_download(/google.com\/foo\/bar/)
click('button#download')
let file1 = download.next_file({timeout: 10e3})
let file2 = download.next_file({timeout: 20e3})
let {image, video} = parse()
collect({file1, file2, image, video})
Action: Navigate to the next page
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: Parse the new link, then navigate to that url within the session or on a new session
Function: parse() + navigate()
Code example:
//Iteraction:
const {next_url} = parse()
navigate(next_url)
//Parser:
return {
next_url: $('a.last-page').attr('href')
}
Action: Deal with infinite scroll where data is loading fully but not shown by JS
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: Find a place where we can collect data (network/API call/script/window field)
Function: tag_response() | tag_script() | tag_window_field()
Code example:
tag_response('reviews', /searchQueryState/)
tag_script('json', '#__NEXT_DATA')
tag_window_field('window', 'preload_data')
Action: Deal with infinite scroll where data is loading in batches
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: Identify the pagination logic in browser network JS and call API to get the data
Function: request() | navigate()
Code example:
//Interaction:
navigate(input.url)
const {last_page} = parse()
//Parser:
return {
last_page: JSON.parse($('body').html()).cat1.last_page
}
Action: Deal with pagination
Worker: Browser Worker or Code Worker
Part: Interaction Code or Parser Code
Solution: Define last page and navigate in loop with page number through all pages
Function: parse() + rerun_stage() + navigate()
Code example:
//Interaction:
navigate(input.url)
const {last_page} = parse()
collect({last_page})
//Parser:
return {
last_page: JSON.parse($('body').html()).cat1.last_page
}
Action: Deal with 'read more' button (or similar)
Worker: Browser Worker only
Part: Interaction Code or Parser Code
Solution: Click the button and then parse the data
Function: click() + parse()
Code example:
//Iteraction:
click('.description-more-btn')
const {description} = parse()
//Parse:
return {
description: $('.description-full').text().trim()
}
Action: Scroll to the bottom of an inner element
Worker: Browser Worker only
Part: Interaction Code or Parser Code
Solution: Define the last element within the element you want to scroll
Function: scroll_to()
Code example:
scroll_to('footer')
Action: Solve Captcha
Worker: Browser Worker only
Part: Interaction Code
Solution: Covered by the unlocker, but in some cases might need to use solve_captcha
Function: solve_captcha()
Code example:
solve_captcha({
selector: '#captcha_container',
input: '#captcha_input',
submit_selector: '.captcha_sbmt',
solved_selector: '.price-container',
solve_timeout: 2000,
type: 'simple'
})
Action: Emulate a different device
Worker: Browser Worker only
Part: Interaction Code
Solution: Specify a device name to change user agent and screen parameters (check the Help window within the IDE for a full list of devices)
Function: emulate_device()
Code example:
emulate_device('Iphone X')