Q: What is Scraping Browser and how can I best use it for collecting data?
A: Scraping Browser is one of our proxy-unlocking solutions and is designed to help you focus on your data collection from browsers while we take care of the full proxy and unblocking infrastructure for you.
It is a multi-stage unlocking product, where you can navigate target websites via libraries such as puppeteer, playwright, and selenium and interact with the site's HTML code to extract the data you need.
Check out our Getting Started guide, to see how simple it is to create a Scraping Browser, integrate it into your code, and then explore some common browser functions and examples that will help with your precise data collection needs.
Q: Which coding languages does Scraping Browser support?
A: Bright Data's Scraping Browser supports a wide range of programming languages and libraries. We currently have full native support for Node.js and Python using puppeteer, playwright, and selenium, and other languages can be integrated as well using the other libraries below, giving you the flexibility to integrate Scraping Browser right into your current tech stack.
Language/Platform | puppeteer | playwright | selenium |
Python | pyppeteer | playwright-python | Selenium WebDriver |
JS / Node | Native support | Native support | WebDriverJS |
Ruby | Puppeteer-Ruby | playwright-ruby-client | Selenium WebDriver for Ruby |
C# | .NET: Puppeteer Sharp | Playwright for .NET | Selenium WebDriver for .NET |
Java | Puppeteer Java | Playwright for Java | Native support |
Go | chromedp | playwright-go | Selenium WebDriver for Go |
Q: How can I debug what's happening behind the scenes during my Scraping Browser session?
A: We understand the importance of having visibility into the inner workings of your Scraping Browser sessions. To assist you in this process, we've created the Scraping Browser Debugger, which is a powerful built-in debugger that seamlessly integrates with Chrome Dev Tools and gives you visibility into your live browser sessions.
The Scraping Browser Debugger is a valuable resource that allows you to thoroughly inspect, analyze, and optimize your code, empowering you to have better control, visibility, and efficiency within your Scraping Browser sessions. To learn more about accessing and utilizing the debugger, as well as getting started with it, please refer to our comprehensive debugger guide.
Q: How can I see a visual of what's happening in the browser?
A1: Triggering a screenshot
You can easily trigger a screenshot of the browser at any time by adding the following to your code:
// node.js puppeteer - Taking screenshot to file screenshot.png
await page.screenshot({ path: 'screenshot.png', fullPage: true });
To take screenshots on Python and C# see here.
A2: Automatically opening devtools to view your live browser session
See our full section on opening devtools automatically.
Q: Why does the initial navigation for certain pages take longer than others?
A: There is a lot of “behind the scenes” work that goes into unlocking your targeted site. Some sites will take just a few seconds for navigation, while others might take even up to a minute or two as they require more complex unlocking procedures. As such, we recommend setting your navigation timeout to “2 minutes” to give the navigation enough time to succeed if needed.
You can set your navigation timeout to 2 minutes by adding the following line in your script before your “page.goto” call.
// node.js puppeteer - Navigate to site with 2 min timeout
page.goto('https://example.com', { timeout: 2*60*1000 });
# python playwright - Navigate to site with 2 min timeout
page.goto('https://example.com', timeout=2*60*1000)
// C# PuppeteerSharp - Navigate to site with 2 min timeout
await page.GoToAsync("https://example.com", new NavigationOptions()
{
Timeout = 2*60*1000,
});
Q: Why does it seem that sometimes the bandwidth billed for my scraping session is more than if I scraped the page myself? For instance, I tried to scrape the page myself and it came out to a total of 200kb - why was I charged for 500kb on the same page?
A: Our scraping browsers download a lot of resources (JS/CSS/Images/etc) during page load and do a number of processes behind the scenes in order to unlock the pages you navigate to. The traffic calculated by Scraping Browser is the total traffic that it takes in order to unlock your page at the given time.
Q: I came across an error code while using Scraping Browser. Can you list the error codes and the meanings behind them?
Error Code | Meaning | What can you do about it? |
Unexpected server response: 407 | An issue with the remote browser's port | Please check your remote browser's port. The correct port for Scraping Browser is port:9222 |
Unexpected server response: 403 | Authentication Error | Check authentication credentials (username, password) and check that you are using the correct "Browser API" zone from Bright Data control panel |
Unexpected server response: 503 | Service Unavailable |
We are likely scaling browsers right now to meet demand. Try to reconnect in 1 minute. |
Q: I can’t seem to connect, do I have a connection issue?
A: You can check your connection with the following curl:
curl -v -u USER:PASS https://brd.superproxy.io:9222/json/protocol
For any other issues please see our Troubleshooting guide or contact support.
Q: What are some tips for reducing bandwidth while scraping?
A: When optimizing your web scraping projects, conserving bandwidth is key. Explore our tips and guidelines below on effective bandwidth-saving techniques that you can utilize within your script to ensure efficient and resource-friendly scraping.
1. Avoid unnecessary media content during scraping
A typical inefficiency when scraping browsers is the unnecessary downloading of media content, such as images and videos, from your targeted domains. Learn below how to easily avoid this by excluding them right from within your script.
Given that anti-bot systems expect specific resources to load for particular domains, approach resource-blocking cautiously, as it can directly impact Scraping Browser's ability to successfully load your target domains. If you encounter any issues after applying resource blocks, please ensure that they persist even when your blocking logic is reverted, before contacting our support team.
Puppeteer:
- Block All Images:
const page = await browser.newPage();
// Enable request interception
await page.setRequestInterception(true);
// Listen for requests
page.on('request', (request) => {
if (request.resourceType() === 'image') {
// If the request is for an image, block it
request.abort();
} else {
// If it's not an image request, allow it to continue
request.continue();
}
});
- Block Specific Image Formats:
const page = await browser.newPage();
// Enable request interception
await page.setRequestInterception(true);
// Listen for requests
page.on('request', (interceptedRequest) => {
// Check if the request URL ends with '.png' or '.jpg'
if (
interceptedRequest.url().endsWith('.png') ||
interceptedRequest.url().endsWith('.jpg')
) {
// If the request is for a PNG or JPG image, block it
interceptedRequest.abort();
} else {
// If it's not a PNG or JPG image request, allow it to continue
interceptedRequest.continue();
}
});
Playwright:
- Block specific resource types such as images and fonts.
// Create a new context with specific resource types blocked
const context = await browser.newContext({
fetchResourceTypesToBlock: ['image', 'font']
});
const page = await context.newPage();
// Navigate to a webpage
await page.goto('https://example.com');
Selenium:
- Use WebDriver functionality to disable images and other media content.
# Set the preference to not load images
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
# Create a new Chrome browser instance with the defined options
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://example.com')
2. Effectively using cached pages
One common inefficiency in scraping jobs is the repeated downloading of the same page during a single session.
Leveraging cached pages - a version of a previously scraped page - can significantly increase your scraping efficiency, as it can be used to avoid repeated network requests to the same domain. Not only does it save on bandwidth by avoiding redundant fetches, but it also ensures faster and more responsive interactions with the preloaded content.
Please note: A single Scraping Browser session can persist for up to 20 minutes. This duration allows you ample opportunity to revisit and re-navigate the page as needed within the same session, eliminating the need for redundant sessions on identical pages during your scraping job.
Let’s see an example
In a multi-step web scraping workflow, you often gather links from a page and then dive into each link for more detailed data extraction. You’ll often need to revisit the initial page for cross-referencing or validation. By leveraging caching, these revisits don't trigger new network requests as the data is simply loaded from the cache.
See an example of this below using puppeteer:
const puppeteer = require('puppeteer-core');
const AUTH = 'USER:PASS';
const SBR_WS_ENDPOINT = `wss://${AUTH}@brd.superproxy.io:9222`;
async function main() {
console.log('Connecting to Scraping Browser...');
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
});
try {
console.log('Connected! Navigating...');
const page = await browser.newPage();
await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });
// Extract product links from the listing page
const productLinks = await page.$$eval('.product-link', links => links.map(link => link.href));
const productDetails = [];
// Navigate to an individual product page
for (let link of productLinks) {
await page.goto(link);
// Extract the product's name
const productName = await page.$eval('.product-name', el => el.textContent);
// Apply a coupon (assuming it doesn't navigate away)
await page.click('.apply-coupon-button');
// Extract the discounted product's price from the cached product detail page
const productPrice = await page.$eval('.product-price', el => el.textContent);
// Store product details
productDetails.push({ productName, productPrice });
}
} finally {
await browser.close();
}
}
3. Other general strategies to minimize bandwidth and ensure efficient scraping
- Limit Your Requests: Only scrape what you need, rather than downloading entire webpages or sites.
- Concurrency Control: Limit the number of concurrent pages or browsers you open. Too many parallel processes can exhaust resources.
- Session Management: Ensure you properly manage and close sessions after scraping. This prevents resource and memory leaks.
- Opt for APIs: If the target website offers an API, use it instead of direct scraping. APIs are typically more efficient and less bandwidth-intensive than scraping full web pages.
- Fetch Incremental Data: If scraping periodically, try to fetch only new or updated data rather than re-fetching everything.