Web Scraper 0.4.0 release
April 17, 2019
We are happy to announce that Web Scraper 0.4.0 has been released. This release contains a new selector, updates to other selectors and improved CSS selector generator. Starting from version 0.4.0 Web Scraper is also available in Firefox.
Sitemap.xml link selector
Many websites want to be crawled by scrapers. For example, news outlets want their articles to appear in search engine results. In order for this to happen, a search engine has to crawl the entire site. The site can make this work more efficient by listing all of the relevant URLs in a sitemap.xml file. This makes the job for a crawler more efficient and also ensures that everything within the site is being indexed.
With Sitemap.xml Link selector you can leverage this feature to access all of the relevant URLs in a site without having to build a path through the site using the Link selectors for navigation and pagination. With a single selector you can access every product page in an e-commerce site. It is always worth checking out whether the site has
sitemap.xml files before creating other selectors, as using this method can speed up the scraper configuration significantly.
When using the Sitemap.xml Link selector use the
Add from robots.txt button to automatically discover
sitemap.xml links. If no links are discovered you can conduct a manual check whether a
example.com/sitemap.xml page exists. Add child selectors under the Sitemap.xml Link selector that extract data from URLs that the
sitemap.xml file leads to.
Element click selector
With this release it is now possible to add an Element Click Selector under another Element Click Selector. With this feature you can go through multiple product color/size variations within a single product page to get the SKU and the price for every variation.
You can also now use element click selector to click through options within a
Element scroll down selector
Element scroll down selector now scrolls down with a smooth animation. It will additionally try a few tricks to trigger the data load event within the website. Generally the Element scroll down selector isn't as reliable as Link selectors but with this update it should also work in some additional edge cases.
I'll start by saying big thanks to Firefox team. They have done a lot work in order to bring the Web Extensions API into their browser. The most painful part of this probably was that they had to remove their previous add-on API with all of the add-ons that developers had been building for years. Despite this, this was a good choice that they made. The Web Extensions API is compatible with other browser and removes the overhead of developing the same solution for different platforms.
CSS Selector generator
When you are selecting an element within a page, Web Scraper generates a CSS selector. In this release we made some improvements to the CSS Selector generator. When generating a CSS Selector the generator will additionally try to use element attributes and their values. Additionally it will generate better CSS selectors for description lists using the
:contains() selector. We made some additional tweaks to reduce the use of order based selector
:nth-of-type() which frequently doesn't work well across multiple pages.