Use cases with code example

This article lists and explains use cases which scraping code has to deal with.

 

Note: Based on the challenge you’re facing and the action you need to perform, you might need to change the worker type of your collector. Read more about the importance of the worker type.

 

Action: Set the collector to run from a specific geo location

Worker: Browser Worker or Code Worker

Part: Interaction Code

Solution: Specify 2-character ISO country code, or proxy location

Function: country() 

Code example: 

country('US')



Action: Navigate to the initial page

Worker: Browser Worker or Code Worker

Part: Interaction Code

Solution: Hard code the URL or use an input variable for the URL or part of it

Function: navigate()

Code example: 

navigate('https://google.com')
navigate(input.url)



Action: Close a popup

Worker: Browser Worker only

Part: Interaction Code

Solution: Find the popup in advanced and use the function to close it (add it at the top of your code)

Function: close_popup()

Code example: 

close_popup('#onetrust-banner-sdk', '#onetrust-accept-btn-handler')



Action: Use the website's search function (if URL can’t be used for that)

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: navigate to the field > type the search term > click a button to search

Function: navigate() + wait() + type() + click()

Code example: 

navigate('https://www.zillow.com/')
type('#search-box-input', 'Los Angeles, CA')
click('#search-icon')



Action: Find the total number of results pages

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: Identify the element and parse the content 

Function: parse() 

Code example: 

return {
pages: $('[class*="Pagination-module__PageNumber__text"]').toArray().map(v => $(v).text().trim()).pop()
}

 

 

Action: Create list of pagination URLs and send them to the next stage

Worker: Both

Part: Interaction Code

Solution: Detect and parse max page number, then in the loop update URL's query parameter 'page' and send to the next stage each one

Function: next_stage()

Code example: 

const {pages} = parse()

for(let i = 1; i <= pages; i++) {
        let url = new URL(location.href)
        url.searchParams.set('page', i)
        next_stage({
                url
        })
}



Action: Scrape list of URLs and send them to the next stage

Worker: Both

Part: Interaction Code

Solution: Detect needed urls from the page, parse them and send to the next stage

Function: next_stage()

Code example: 

const {items} = parse()
items.forEach(i => {
       next_stage({url: i})
})



Action: Find all urls to all relevant pages

Worker: Browser Worker or Code Worker

Part: Parser Code

Solution: Parse the results page to get all links

Function: parse()

Code example: 

return {
pages: $('.pagination-paginationMeta').text().split(' ').pop(),
links: $('li.product-base a[target="_blank"]').toArray().map(e => new URL($(e).attr('href'), location.href))
}



Action: Page has inner options which affects the data on the page

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: For a small amount of variations (~5) use a loop to go one by one through all variations. For larger amount you should break it down in to sessions and not run all on the same one

Function: click() | rerun_stage()

Code example: 

if(input.is_rerun && input.size_index) {
wait('.size-item')
$('.size-item').eq(input.size_index).click()
wait('.price-amount')
}
if(sizes.length <= 5 && sizes.length > 1) {
let prices = []
for(let i = 0; i <= sizes.length; i++) {
$('.size-item').eq(i).click()
wait('.price-amount')
prices.push(parse().price)
}
} else if(sizes.length > 5){
sizes.forEach(size => {
rerun_stage({
url: location.href,
is_rerun: true,
size_index: size.index
})
})
}



Action: Parse the data on the page

Worker: Browser Worker or Code Worker

Part: Parser Code

Solution: Define selectors and use parse function

Function: parse()

Code example: 

const {items, pages} = parse()



Action: Collect the required data on a defined page

Worker: Browser Worker or Code Worker

Part: Interaction Code

Solution: Define the required data in the output schema and use the collect function in the code to collect the outputs

Function: collect()

Code example: 

const {data} = parse()
collect({
product_name: data.name,
price: data.price
})



Action: Collect downloaded files

Worker: Browser Worker only

Part: Interaction Code or Parser Code

Solution: For image or video use the parse function, or teg the image/video/other file

Function: parse() | tag_image() | tag_video() | tag_download()

Code example: 

tag_video('video', '#product-video', {download: true})
tag_image('image', '#product-image')
let download = tag_download(/google.com\/foo\/bar/)
click('button#download')
let file1 = download.next_file({timeout: 10e3})
let file2 = download.next_file({timeout: 20e3})
let {image, video} = parse()
collect({file1, file2, image, video})



Action: Navigate to the next page

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: Parse the new link, then navigate to that url within the session or on a new session

Function: parse() + navigate()

Code example: 

//Iteraction:
const {next_url} = parse()
navigate(next_url)

//Parser:
return {
next_url: $('a.last-page').attr('href')
}



Action: Deal with infinite scroll where data is loading fully but not shown by JS

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: Find a place where we can collect data (network/API call/script/window field)

Function: tag_response() | tag_script() | tag_window_field()

Code example: 

tag_response('reviews', /searchQueryState/)
tag_script('json', '#__NEXT_DATA')
tag_window_field('window', 'preload_data')



Action: Deal with infinite scroll where data is loading in batches

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: Identify the pagination logic in browser network JS and call API to get the data

Function: request() | navigate()

Code example: 

//Interaction:
navigate(input.url)
const {last_page} = parse()

//Parser:
return {
last_page: JSON.parse($('body').html()).cat1.last_page
}



Action: Deal with pagination

Worker: Browser Worker or Code Worker

Part: Interaction Code or Parser Code

Solution: Define last page and navigate in loop with page number through all pages

Function: parse() + rerun_stage() + navigate()

Code example: 

//Interaction:
navigate(input.url)
const {last_page} = parse()
collect({last_page})

//Parser:
return {
last_page: JSON.parse($('body').html()).cat1.last_page
}



Action: Deal with 'read more' button (or similar)

Worker: Browser Worker only

Part: Interaction Code or Parser Code

Solution: Click the button and then parse the data

Function: click() + parse()

Code example: 

//Iteraction:
click('.description-more-btn')
const {description} = parse()

//Parse:
return {
description: $('.description-full').text().trim()
}



Action: Scroll to the bottom of an inner element

Worker: Browser Worker only

Part: Interaction Code or Parser Code

Solution: Define the last element within the element you want to scroll

Function: scroll_to()

Code example: 

scroll_to('footer')



Action: Solve Captcha

Worker: Browser Worker only

Part: Interaction Code

Solution: Covered by the unlocker, but in some cases might need to use solve_captcha

Function: solve_captcha()

Code example: 

solve_captcha({
selector: '#captcha_container',
input: '#captcha_input',
submit_selector: '.captcha_sbmt',
solved_selector: '.price-container',
solve_timeout: 2000,
type: 'simple'
})




Action: Emulate a different device

Worker: Browser Worker only

Part: Interaction Code

Solution: Specify a device name to change user agent and screen parameters (check the Help window within the IDE for a full list of devices)

Function: emulate_device()

Code example: 

emulate_device('Iphone X')



Was this article helpful?