Scraping E-commerce the Classical Way
September 16, 2020
Data, E-commerce, Tutorial
With the enormous growth and development in technology, data being the main driver of modern and fast-growing companies, online business has evolved over the recent years. However, it comes as no surprise since ordering, reserving goods, and services online while not leaving the house is a huge time saver and accessible for everyone with a stable internet connection.
Seems like everything is centered around data nowadays, working as the fuel that moves businesses forward in this digital age. With that comes the importance of data collection. In this blog, we will show the classical way of applying Web Scraper to your e-commerce data extraction needs.
The Classical Method
The first and most primitive, also the most intuitive way of scraping with Web Scraper. It is by mapping the site using the point-and-click system to set the parameters for the scraper to follow and extract the target data. For example, first category selectors, to subcategories, then product links, and the prices, names, descriptions, and so on.
Okay, this might confuse some; therefore, let us show an example to better explain:
First and foremost, to start working with Web Scraper, we need to create a sitemap that we can further develop and designate selectors to which data needs to be retrieved and how.
To do that, click on the “Create new sitemap” and decide upon a custom sitemap name, and copy-paste the websites URL, which you would like to use as the starting point for the scraper, then click “Save Sitemap”.
Now let us create the first selector. This will be a bit harder one in our case because of the website we have chosen. To select the two categories of “women” and “men” we are going to use the “Inspect” log of the developer tool and input a custom selector like this:
.accessible-navmenu > li[data-behavior="mega_menu"] > a:contains("Womens"), .accessible-navmenu > li[data-behavior="mega_menu"] > a:contains("Mens")
Keep in mind and make sure that this selector will be a link selector, and the “multiple” log has to be checked since various entries are needed. And we are going to name our first selector “gender” and click on “Save selector”.
Now, diving deeper, we click on the previously created “gender” link-selector in the toolbar, and on the web site go to the “womens” section. This is necessary because now we are going to create a child-selector (meaning that a selector under a selector, a new selector branch from a previous one will be created). Another link-selector that will visit each of the subcategories, and call it the “category-url”.
Now, let’s visit the first subcategory of the website, and in the developer toolbar select the previously created selector. We have gotten to the product-list page, and this is where we are going to create our pagination selector. In general, the pagination link selector works that you select the pagination logs which are usually below (footer) or above (header) the page content; however, this case is a little bit more difficult; therefore, we are going to create our pagination link from deriving the necessary information from the “Inspect”:
Also, another very important part when creating the pagination selector is to designate it as a child-selector for the “category-url” and itself (as in two parent selectors). And do not forget to tick the “multiple” log, since various pages are needed to be scraped, then click on “Save selector”.
Now, on the same product-list page and under the “category-url” we create the “product-url” selector. Here it is also important that the “pagination” link-selector is selected as the second parent selector!
And now the third and final selector under the “category-url” selector will be a text selector which will indicate from which category the product comes from. Important is not to check the “multiple” log, since only one entry is needed, and for this selector also, two parent selectors are needed to be designated - the “category-url” and the “pagination” selectors.
Now we visit the first product through the website and select the previously created “product-url” selector. We have gotten to the last steps of this e-commerce page. For the scrapers to collect the tiles of the product, we create a text selector. Important that the “multiple” is not checked, since, for each scraper, each “product-url” visit needs to retrieve only the one and specific product title.
Now we create another text selector to retrieve the price. Keeping in mind that this needs to be a child selector of the “product-url” link selector, same as for the “product-title” selector. And, the same as the “product-title”, we leave the “multiple” un-ticked.
As our last but not the least step on this e-commerce web page, we create a text selector that will retrieve the color of the product.
Great! Now, to make sure that we have got all the selectors in all the right places, we check the selector graph tree. The order of the selectors plays a crucial role when scraping with Web Scraper, most especially when scraping with the classical method. It explains the scraper of when (when the specific page has to be visited) and what (what in the page has to be retrieved).
Since we have made sure that our selectors are in the right positions, we then click on “Scrape” and watch how the scraper beautifully runs and does the data collection for us.
That is all of how to scraper an e-commerce site with the classical method. However, since we love our data transformed and cleaned, we imported the sitemap into Cloud Scraper and went to apply the Parser feature. Firstly simply just to delete the unnecessary columns like “web-scraper-order”, “web-scraper-start-url”, and such.
Then with a simple “Replace text” parser, we eliminated the repeating word of “Colour” in the “product-color” column, to simply project each color in the cell and nothing more.
And lastly created another “Replace text” parser for the “product-price” column to delete the “£” currency symbol.
Now our data looks more consistent and neater.
That is all! The classical way of scraping data can be a hassle (depending on the webpage and the data that you are looking to gather); however, it can be quite interesting to watch and manually designate selectors.
For more information about the different selectors, visit documentation.
For more information about the Parser feature or the tutorial video of this blog, see our YouTube channel.