A Quick Guide to CSS and jQuery Selectors for Web Scraper
June 28, 2022
JQuery selectors, web scraper, Scraping public data, scraper, CSS selectors
Selectors when it comes to CSS and jQuery refer to code you can use to interact with specific elements in the DOM. They are particularly useful when you want to style or act upon certain elements of the document without altering other elements. Through a careful use of selectors, you can manipulate a page exactly as intended or isolate the specific data you’re looking to scrape.
Requirements
To best benefit from this article, you’ll need at least basic knowledge of HTML and the DOM. Beyond this, you’ll need minor experience in CSS and jQuery. We’ll show how to use selectors within the context of the Web Scraper Chrome tool, so this extension can simplify this process as well as help you follow along with the guide.
What to Expect
In this article, we’ll cover 19 of the most relevant CSS selectors as well as nth children and nth-of-type selectors. Afterwards, we’ll cover a few jQuery selectors with a focus mainly on how you can chain them like if-then statements. For all these selector examples, we’ll be isolating elements on the webscraper.io home page.
CSS Selectors
The CSS Selectors we’re going to cover primarily relate to isolating html elements through classes, ids, element types, and attributes. By understanding the notation and logic used in the most common selectors, you can concisely write and combine selector logic to be even more precise with which elements you select from a page.
.class
The class selector selects all elements related to a specified class. For example, the .under-hero__content class selects all elements with this class found on the page:
.class1.class2
The class selector can be extended to include a second class attribute when more than one describes an element. In this example, we select specific blocks of text by referencing two elements home-cta__title and home-cta__title–testi:
.class1 .class2
By leaving a space in between class names within the selector, you can select a second class which is a descendent of the first class. In this example, we can see the .home-features__text class descended from the .cell class which contains a header and subtext:
#id
Referencing the id with a hashtag lets you select all elements matching this id within a page . Here the #menu-main-menu applies specifically to the main navigation bar:
*
Perhaps you want to select all elements on a page. This selector serves as a catch-all. We see all page elements captured here:
Element-Type
You can select all of a type of element like p or div by referencing the element type. Here we see only the span elements are selected:
Element-Type.class
By combining the element type and a class selector you can more specifically select elements on a page. Unlike the class selector alone, this will only choose the element-type with the corresponding class. Since classes can still apply to multiple of the same element-types, this can yield more than one result. Here we reference div elements only with the home-cta__text class:
Element-Type#id
By combining the element type and an id selector you can pick out a specific element on the page with precision. This is functionally the same as the id selector alone since every id is unique in the document, and it will always yield only 1 result. Here we reference an li element only with the id corresponding to the first menu item:
Element-Type-1, Element-Type-2
You can reference multiple comma-separated element types to increase the range of your selection. Here we can capture both p and h3 types:
Element-Type-1 Element-Type-2
You can limit your selection to only element types which are inside of a specific element type. A direct parent-child relationship doesn’t matter here as long as Element-Type-2 is within Element-Type-1. Here we can isolate span types only found within h2 types:
Element-Type-1>Element-Type-2
You can limit your selection to only elements which are direct descendants of another element. This differs from the previous selector because Element-Type-1 must be an immediate parent of Element-Type-2. Here we select a span type immediately within a p type:
Element-Type-1+Element-Type-2
You can limit your selection to only an Element-Type-2 placed directly after an Element-Type-1. In this case, only the ordering of the elements is relevant. We can see which p elements immediately follow an h2 element here:
Element-Type-1~Element-Type-2
Conversely, you can select for an Element-Type-2 which precedes an Element-Type-1. Here we select for ul elements only with a p element preceding them.
[attribute]
Select all elements with the attribute applied to the element. Here we can select for elements where the target attribute exists irrelevant of the content of the attribute:
[attribute=value]
Select all elements where the attribute matches a specific value. Here our class attribute must equal only “home-features__text”. This, essentially, functions in the same way as using a class selector .home-features__text :
[attribute~=value]
Select all elements where the attribute contains a word equal to the value. In this case we’re selecting for elements with the class attribute containing the “cell” word:
[attribute|=value]
Selects all elements where the attribute starts with or equals the value. Here we look for the class attribute starting or equaling “home”:
element[attribute^=”value”]
Selects every element type with an attribute beginning with the value. Here we’re looking only for class attributes beginning with “button”, so we can identify all the buttons on a page:
element[attribute$=”value”]
Select every element with the attribute ending with the input value. In this case, we look for href attributes ending in “pricing-section” to select elements related to pricing links.
element[attribute*=”value”]
Select every element with the attribute containing the input value. In this case, we look href attributes containing “test”. This works best when working with generated classes, where one item has the value of [href='test-123'] and another has been set to [href='test-345'], using the selector [href*='test'] will return both elements:
element:nth-child(#)
Nth-child selects the element which is a child at the # position below a parent element when there are group of siblings which make up the parent’s child elements. With this selector, we don’t have to be specific about the parent element. Here we’re selecting for a p element which is the 2nd child of its parent. In this case, it follows a div which is the first child:
element:nth-of-type(#)
Nth-of-type selects the element which is at the # position among multiple of the same element type siblings under a parent element. For this selector, it will ignore the position for siblings which are not of element type. Here we’re selecting for the element which is the 2nd p element (it is not the 2nd element of the siblings) to appear in a group of siblings:
jQuery Selectors
jQuery Selectors Overview
We’re going to cover some of the main jQuery selectors particularly with respect to contains and has. Then, we’ll show how you can chain the two together for more specific element selection.
element:contains(‘text’)
Contains lets us choose specific elements which contain the input text string. Here we want an h3 containing the string “Point”:
element:not(:contains(‘text’))
By adding not before the contains selector, we can choose specific elements which do not contain the input text string. Here we want an h3 element not containing “Point”:
Element-Type-1:has(Element-Type-2)
Matches an Element-Type-1 only if there is an Element-Type-2 anywhere in its descendents. Here we select for ul elements containing li elements:
Element-Type-1:not(:has(Element-Type-2))
Matches an Element-Type-1 only if it does not have Element-Type-2 anywhere in its descendents. Here we select for link elements which do not have img elements under them:
Chaining These Selectors
While there are many other jQuery selectors, the reason it’s valuable to use these selectors is you can leverage them almost exclusively to logically isolate elements on a page. For both :has() and :contains(), you can consider them almost as the if-clause in an if-then statement. For example, if element div has element p or contains text, then perform some action in jQuery. Conversely, you can use the not selector to invert the logic.
All this being said, you can chain them together to isolate specific elements on a page you’re scraping. Here we look for all li elements which have descendents which do not contain the text “Pricing”.
Here’s another example where we chain CSS selector conditions with jQuery selector conditions. By specifying the hierarchy between the .dropdown class, the select element, and the option elements which do not contain the ‘Select colour’ text, we’re able to isolate the three colour options for this product.
Summary
As you’ve seen here, CSS and jQuery selectors can be very useful for web scraping. We showed examples of some of the most valuable ones you’d typically use to set up a scraping job. By knowing both sets of selectors, you can maintain greater flexibility when it comes to selection while also having the power to isolate whatever text, images, links, or other elements you need.