Top 5 CSS Selectors You Need to Know.
November 06, 2020
Data, web scraping, Tutorial
Scraping might get hard at times when you're dealing with website structures that frequently change or in general, are hard to scrape with just the point-and-click interface. No need to stress, with the knowledge of the CSS selector - any website structure can be overcome and any website - scraped. Here are the TOP 5 CSS selectors that we frequently use and might be of great use to you.
First and foremost - when creating selectors, you might have noticed that in the drop-down menu of selectors there is no such “CSS selector” listed. This is because the CSS selector is always working and visible. It is the log that, when elements are selected from a web page, shows as the HTML parts of the elements selected.
Web Scraper uses CSS selectors to find HTML elements on the designated website. When elements are selected, the CSS selector makes a best guess of what the selected elements might be, and what to extract from them. This is where the flexibility of designing specific CSS selectors in different situations comes in.
First - ":contains"
For example, in a situation when specific categories are needed, but the selector selects the whole log of categories. The ":contains" selector will recognize the specific text you are looking to retrieve.
To select only, for example, the categories of “Tops”, “Dresses”, and “Jeans”, we have to custom-write the CSS selector.
To the selected “[aria-hidden] a”, we add a colon, and with “contains(“Tops”)” we designate the selector to select the specific category which contains the string “Tops”. Now add the same to designate also the dresses and jeans categories and that’s it - a link selector, which only will select the 3 specific categories has been created.
Second - ":not(:contains)"
However, for situations where all listings are necessary except for listings that contain a specific characteristic, most commonly seen when scraping product pages, the ":not(:contains)" comes in use. Exactly as the ":contains" selector it works for text-related strings.
For example, all dresses are of interest except for the ones that are knitted. To exclude that specific characteristic, we take the CSS selector that was created when selecting the elements, and add the “:not(:contains(“Knit”))” equation. Similar to the CSS selector we created for the specific categories; however, with the “not” part the selector is designated to collect all except for the specific.
Third - ">"
Diving deeper into the possibilities of creating custom CSS selectors - we have the “greater than “>”” symbol selector. It selects the direct “child” element of your chosen selector.
To better explain - this is how the CSS selector has guessed the elements we are looking to retrieve from selecting them with the point-and-click interface.
For this current version of the website, it will work and retrieve the necessary price points selected; however, with the “>” selector we can adjust the scraper to be more precise when selecting the elements. This can come in handy when selecting elements containing an attribute that can be found in multiple positions within the site. This will ensure that the scraper will only extract the value, which positionally follows this particular structure.
To better explain what it really does, let’s take a look at the websites HTML elements.
Now, the written CSS selector will be working as it will look at the “product-item”, search two positions down from there to the “item-details” and one position lower to retrieve the “item-price”.
However, in this situation, for this specific website, we see that there is an even shorter, more precise, and flexible way of retrieving the prices of the items which would decrease the chance of the selector to break when the structure of the website is altered.
As simple as:
On some websites, it would require longer strings of the CSS selector - all depends on the layout of the website’s elements.
Fourth - ":has"
This selector is specifically for finding attributes inside an element. For example, on our chosen website, we want to retrieve dresses which are in black or one of the available colors is black.
With just point-and-click, it would not be possible to select this specific character attribute of the item elements; therefore, this is a very useful and crucial selector for precise scraping.
Fifth - "~"
The “~” selector selects all the elements after a specific one if they are in the same structural level. For example, when selecting the characteristics of a product - we select the main element and with the “~” specify that all the elements positioned next to it have to be retrieved.
These are the most primary and most used by our team. The main takeaway is that creating custom CSS selectors will decrease the chances of the sitemap breaking whenever there are any changes applied to the website; however, it takes time to examine and figure out which selector is the best case for the specific layout of the website and the specific data that is needed to be retrieved.
There are more possibilities and variations of CSS selector for more advanced and specified scraping necessities; however, the two mentioned in this blog might be very useful for not only advanced-level scrapers but also very important for beginner-level.
For more information and details of other CSS selectors, you might be interested in checking CSS selector documentations: