This article explains how to refer to elements of the website you’re scraping within your scraping code’ and how to find them within the website’s HTML and CSS.
Most commands in the interaction code will require a selector to target the correct element. In order to target an element (to click it or pull text out), you need to specify the element with a CSS selector. A CSS selector can match one or more items on the page.
wait('selector'); // wait for this element to appear
click('selector'); // click on this element after it appears
If you're having trouble finding the correct selector for the element you want, the Google Chrome DevTools contain a button for it:
- Right click on the desired element and click the Inspect option. This will open the element inspector and highlight the element
- Right click on the highlighted element, hover over the Copy option and choose the Copy selector option
Note that this selector is a good starting point, but they're usually too specific. You should try to make it less specific by removing X, Y and Z so that it will always match the element you want.
Building element selectors
Selectors are built out of 4 basic components:
- p : the element type selector. This example will match any <p></p> element on the page
[href] : square brackets is an attribute selector. This example will match any element with the href attribute set (e.g. <a href="/home"></a>)
You can also specify a value in an attribute selector: [href="/about"] will match the <a></a> that links to the about page
- .price : if it starts with a ".", it's a class selector. It will match any element with the class attribute set to "price" (e.g. <span class="price">...</span>)
- #search : if it starts with a "#", it's an ID selector. It will match any element with the id attribute set to "search" (e.g. <button id="search">Search</button>)
And you can chain them together to specify the element you want. In this example:
<input class="query" placeholder="Enter a query"/>
<li><a href="/results/1">First result</a></li>
<li><a href="/results/2">First result</a></li>
You can use the following selectors:
type('.form .query', 'nike'); // enter some text in the search
click('.search-form #submit'); // run the search
wait('.result'); // wait for the results to appear
click('.result:nth-child(2) a'); // click the 2nd result link
You can easily verify if your selector is working by going to the target website, opening Chrome DevTools (either inspect any element following the instructions above or use the shortcut: Ctrl+Shift+I), go to the Elements tab (if not already on it), press Ctrl+F to open the search input below the elements and type your selector. Chrome will show you all the elements identified by it.
Note: Chrome element search also matches by text. This can be confusing for simple selectors. For instance, the following screenshot shows a scenario where the <p></p> element was matched but the actual letter "p" in its content was also matched.
Sometimes selecting the element you want can be complex. There are a few tricks you can use to make it simpler:
- [class^="product"]: select an element by class prefix. So this will match <div class="product"></div> and <div class="product-special"></div>. This notation works for any attribute. (e.g. [href^="/jobs"] will match <a href="/jobs/123"></a> and <a href="/jobs/234"></a>)
- .product>a: select only an "a" that is a direct child of the .product element. Will match the <a> in <div class="product"><a href="..."></a></div> but not <div class="product"><div class="seller-info"> <a href="..."></a></div></div>
- .pagination a:nth-child(2): select the 2nd link in a pagination bar
- .pagination a:nth-last-child(2): select the 2nd last link in a pagination bar
Check out the MDN documentation about CSS selectors for more information