Scraping E-commerce The Fastest Way
September 29, 2020
Data, E-commerce, Tutorial
In the previous blogs, we have gone through the classical way of scraping and exposed a tip of how to potentially prevent sitemaps from breaking when an e-commerce site changes product placements in designated categories. Now we are going to take a look at another scraping method which is - scraping with the “Sitemap.xml Links” selector.
Deliberately the fastest way of gathering data from e-commerce sites. It works in a way that a “Sitemap.xml Link” selector is created which goes through the pages of the website and collects the necessary information, potentially reducing the time it takes to go through the website and create category, subcategory, etc selectors manually.
The process of scraping through websites sitemaps starts as every scraping project - by creating a scraper sitemap.
Now, the tricky part of this method is that we have to figure out where in the website lays the list of sitemaps that we need. There are three potential links in which the sitemaps should lay; however, there are cases when the sitemaps are so well hidden that the e-commerce site will not be able to be scraped through this method.
The three potential places are:
First, we check the “robots.txt” file by visiting the "robots.txt" file or the website.
The sitemaps are not listed here; however, it does show that the sitemaps may lay under “/sitemap.xml” file.
That is it. Not only it shows every sitemap, but also we can observe that the product sitemaps are also here. We are going to use this for our selector.
Let’s go back to the extension in the developer’s tool and begin by creating the “Sitemap.xml Links” selector.
Now that we have created the selector, we need to create an element selector which will be an indicator for the scraper in which pages to trigger and retrieve the information, and not return “null” values for every page that is not a product page.
It can be done by visiting any product page and creating the selector based on a feature that every product page will have, usually it is the product title attribute. We will retrieve the necessary class from the “Inspect” feature. Important that the “multiple” is ticked. Otherwise, the scraper will return “null” values for the empty pages.
Now the last step is creating text selectors for which information is needed to be retrieved from the product pages.
And that is all for creating the selectors. Let’s check the selector graph to make sure that everything is in place.
If no adjustments or changes are needed, go ahead and start scraping.
After a while, once the scraper finally navigates to the product pages, we see the scraped data.
The scraping process might be slightly longer, as the scraper is iterating through all of the pages listed in the sitemap.xml file; however, if you desire to shorten the scraping process, for this website you can manually paste only the product sitemaps (found in the “/sitemaps.xml” file of the website) separately when creating the “Sitemap.xml LInks” selector; therefore, the scraper will not visit the miscellaneous pages on the website.
That is all on how to scrape with the “Sitemap.xml Links” selector. Hope this helped. If you are looking for more information on how to scrape various data, visit our blog page or take a look at our YouTube tutorial videos that are available on our YouTube channel.