Installation

You can install the extension from Chrome store. After installing it you should restart chrome to make sure the extension is fully loaded. If you don't want to restart Chrome then use the extension only in tabs that are created after installing it.

Requirements

The extension requires Chrome 31+ . There are no OS limitations.

Open Web Scraper

Web Scraper is integrated into chrome Developer tools. Figure 1 shows how you can open it. You can also use these shortcuts to open Developer tools. After opening Developer tools open Web Scraper tab.

Shourtcuts:

  • windows, linux: Ctrl+Shift+I, f12, open Tools / Developer tools
  • mac Cmd+Opt+I, open Tools / Developer tools

Fig. 1: Open Web Scraper

Scraping a site

Open the site that you want to scrape.

Create Sitemap

The first thing you need to do when creating a sitemap is specifying the start url. This is the url from which the scraping will start. You can also specify multiple start urls if the scraping should start from multiple places. For example if you want to scrape multiple search results then you could create a separate start url for each search result.

Specify multiple urls with ranges

In cases where a site uses numbering in pages URLs it is much simpler to create a range start url than creating Link selectors that would navigate the site. To specify a range url replace the numeric part of start url with a range definition - [1-100]. If the site uses zero padding in urls then add zero padding to the range definition - [001-100]. If you want to skip some urls then you can also specify incremental like this [0-100:10].

Use range url like this http://example.com/page/[1-3] for links like these:

  • http://example.com/page/1
  • http://example.com/page/2
  • http://example.com/page/3

Use range url with zero padding like this http://example.com/page/[001-100] for links like these:

  • http://example.com/page/001
  • http://example.com/page/002
  • http://example.com/page/003

Use range url with increment like this http://example.com/page/[0-100:10] for links like these:

  • http://example.com/page/0
  • http://example.com/page/10
  • http://example.com/page/20

Create selectors

After you have created the sitemap you can add selectors to it. In the Selectors panel you can add new selectors, modify them and navigate the selector tree. The selectors can be added in a tree type structure. The web scraper will execute the selectors in the order how they are organized in the tree structure. For example there is a news site and you want to scrape all articles whose links are available on the first page. In image 1 you can see this example site.

Fig. 1: News site

To scrape this site you can create a Link selector which will extract all article links in the first page. Then as a child selector you can add a Text selector that will extract articles from the article pages that the Link selector found links to. Image below illustrates how the sitemap should be built for the news site.

Fig. 2: News site sitemap

Note that when creating selectors use Element preview and Data preview features to ensure that you have selected the correct elements with the correct data.

More information about selector tree building is available in selector documentation. You should atleast read about these core selectors:

Inspect selector tree

After you have created selectors for the sitemap you can inspect the tree structure of selectors in the Selector graph panel. Image below shows an example selector graph.

Fig. 3: News site selector graph

Scrape the site

After you have created selectors for the sitemap you can start scraping. Open Scrape panel and start scraping. A new popup window will open in which the scraper will load pages and extract data from them. After the scraping is done the popup window will close and you will be notified with a popup message. You can view the scraped data by opening Browse panel and export it by opening the Export data as CSV panel.

Selectors

Web scraper has multiple selectors that can be used for different type data extraction and for different interaction with the website. The selectors can be divided in three groups:

  • Data extraction selectors for data extraction.
  • Link selectors for site navigation.
  • Element selectors for element selection that separate multiple records

Data extraction selectors

Data extraction selectors simply return data from the selected element. For example Text selector extracts text from selected element. These selectors can be used as data extraction selectors:

Link selectors

Link selectors extract URLs from links that can be later opened for data extraction. For example if in a sitemap tree there is a Link selector that has 3 child text selectors then the Web Scraper extract all urls with the Link selector and then open each link and use those child data extraction selectors to extract data. Of course a link selector might have Link selectors as child selectors then these child Link selectors would be used for further page navigation. These are currently available Link selectors:

Element selectors

Element selectors are for element selection that contain multiple data elements. For example an element selector might be used to select a list of items in an e-commerce site. The selector will return each selected element as a parent element to its child selectors. Element selectors child selectors will extract data only within the element that the element selector gave them. These are currently available Element selectors:

Selector configuration options

Each selector has configuration options. Here you can see the most common ones. Configuration options that are specific to a selector are described in selectors documentation.

  • selector - CSS selector that selects an element the selector will be working on.
  • multiple - should be checked when multiple records (data rows) are going to be extracted with this selector. Data extracted from two or more selectors with multiple checked wont be merged in a single record.
  • delay - delay before selector is being used.
  • parent selectors - configure parent selectors for this selector to make the selector tree.

Note! A common mistake when using multiple configuration option is to create two selectors alongside with multiple checked and expect that the scraper will join selector values in pairs. For example if you selected pagination links and navigation links these links couldn't be logically joined in pairs. The correct way is to select a wrapper element with Element selector and add data selectors as child selectors to the element selector with multiple option not checked.

Text selector

Text selector is used for text selection. The text selector will extract text from the selected element and from all its child elements. HTML will be stripped and only text will be returned. Selector will ignore text within <script> and <style> tags. New line <br> tags will be replaced with newline characters. You can additionally apply a regular expression to resulting data.

Configuration options

  • selector - CSS selector for the element from which data will be extracted.
  • multiple - multiple records are being extracted. Usually should not be checked. If you want to use multiple text selectors within one page with multiple checked then you might actually need Element selector.
  • regex - regular expression to extract a substring from the result.

Regex

The regular expression attribute can be used to extract a substring of the text that the selector extracts. When a regular expression is used the whole match (group 0) will be returned as a result www.regexr.com is a great site where you can learn about regular expressions and try them out.

Here are some examples that you might find useful:

text regex result
price: 14.99$ [0-9]+\.[0-9]+ 14.99
id: H83JKDX4 [A-Z0-9]{8} H83JKDX4
date: 2014-08-20 [0-9]{4}\-[0-9]{2}\-[0-9]{2} 2014-08-20

Use cases

Extract one record per page with multiple text selectors

For example you are scraping news site that has one article per page. The page might contain the article, its title, date published and the author. A Link selector can navigate the scraper to each of these article pages. Multiple text selectors can extract the title, date, author and article. Multiple option should be left unchecked for text selectors because each page is extracting only one record.

Fig. 1: Multiple text selectors per page

Extract multiple items with multiple text selectors per page

E-commerce sites usually have multiple items per page. If you want to scrape these items you will need an Element selector that selects item wrapper elements and multiple text selectors that select data within each item wrapper element.

Fig. 2: Multiple elements with text selectors. Some arrows are skipped.

Extract multiple text records per page

For example you want to extract comments for an article. There are multiple comments in a single page and you only need the comment text (If you would need other comment attributes then see the example above). You can use Text selector to extract these comments. The Text selectors multiple attribute should be checked because you will be extracting multiple records.

Fig. 3: Text selector selects multiple comments

Link selector

Link selector is used for link selection and website navigation. If you use Link selector without any child selectors then it will extract the link and the href attribute of the link. If you add child selectors to Link selector then these child selectors will be used in the page that this link was leading to. If you are selecting multiple links then check multiple property.

Note! Link selector works only with <a> tags with href attribute. If the link selector is not working for you then you can try these workarounds:

  1. Check that the link in the url bar changes after clicking an item (changes only after hash tag doesn't count). If the link doesn't change then the site is probably using ajax for data loading. Instead of using link selector you should use Element click selector.
  2. If the site opens a popup then you should use Link popup selector
  3. The site might be using JavaScript window.location to change the URL. Web Scraper cannot handle this kind of navigation right now.

Configuration options

  • selector - CSS selector for the link element from which the link for navigation will be extracted.
  • multiple - multiple records are being extracted. Usually should be checked.

Use cases

Navigate through multiple levels of navigation

For example an e-commerce site has multi level navigation - categories -> subcategories. To scrape data from all categories and subcategories you can create two Link selectors. One selector would select category links and the other selector would select subcategory links that are available in the category pages. The subcategory Link selector should be made as a child of the category Link selector. The selectors for data extraction from subcategory pages should be made as a child selectors to the subcategory selector.

Fig. 1: Multiple link selectors for category navigation

Handle pagination

For example an e-commerce site has multiple categories. Each category has a list of items and pagination links. Also some pages are not directly available from the category but are available from pagination pages (you can see pagination links 1-5, but not 6-8). You can start by building a sitemap that visits each category and extract items from category page. This sitemap will extract items only from the first pagination page. To extract items from all of the pagination links including the ones that are not visible at the beginning you need to create another Link selector that selects the pagination links. Figure 2 shows how the link selector should be created in the sitemap. When the scraper opens a category link it will extract items that are available in the page. After that it will find the pagination links and also visit those. If the pagination link selector is made a child to itself it will recursively discover all pagination pages. Figure 3 shows a selector graph where you can see how pagination links discover more pagination links and more data.

Fig. 2: Sitemap with Link selector for pagination Fig. 3: Selector graph with pagination

Link popup selector

Link popup selector works similarly as Link selector. It can be used for url extraction and site navigation. The only difference is that Link popup selector should be used when clicking on a link the site opens a new window (popup) instead of loading the URL in the same tab or opening it in a new tab. This selector will catch the popup creation event and extract the URL. If the site creates a visual popup but not a real window then you should try Element click selector

Note! when selecting these link elements you can move the mouse over the element and press "S" to select it to prevent it from opening a popup.

Use cases

See Link selector use cases.

Image selector

Image selector can extract src attribute (URL) of an image. Optionally you can also store the images. The images will be stored in your downloads directory:

Downloads/<sitemap-id>/<selector-id>/<image filename.jpg>

Note! When selecting CSS selector for image selector all the images within the site are moved to the top. If this feature somehow breaks sites layout please report it as a bug.

Configuration options

  • selector - CSS selector for the image element.
  • multiple - multiple records are being extracted. Usually should not be checked for Image selector.
  • download image - downloads and store images on local drive. When CouchDB storage back end is used the image is also stored locally.

Use cases

See Text selector use cases.

Table selector

Table selector can extract data from tables. Table selector has 3 configurable CSS selectors. The selector is for table selection. After you have selected the selector the Table selector will try to guess selectors for header row and data rows. You can click Element preview on those selectors to see whether the Table selector found table header and data rows correctly. The header row selector is used to identify table columns when data is extracted from multiple pages. Also you can rename table columns. Figure 1 shows what you should select when extracting data from a table.

Fig. 1: Selectors for table selector

Configuration options

  • selector - CSS selector for the table element.
  • header row selector - CSS selector for table header row.
  • data rows selector - CSS selector for table data rows.
  • multiple - multiple records are being extracted. Usually should be checked for Table selector because you are extracting multiple rows.

Use cases

See Text selector use cases.

Element attribute selector

Element attribute selector can extract an attributes value of an HTML element. For example you could use this selector to extract title attribute from this link: <a href="#" title="my title">link<a>.

Configuration options

  • selector - CSS selector for the element.
  • multiple - multiple records are being extracted.
  • attribute name - the attribute that is going to be extracted. For example title, data-id.

Use cases

See Text selector use cases.

HTML selector

HMTL selector can extract HTML and text within the selected element. Only the inner HTML of the element will be extracted.

Configuration options

  • selector - CSS selector for the element whose inner HTML will be extracted.
  • multiple - multiple records are being extracted.

Use cases

See Text selector use cases.

Grouped selector

Grouped selector can group text data from multiple elements into one record. The extracted data will be stored as JSON.

Configuration options

  • selector - CSS selector for the elements whose text will be extracted and stored in JSON format.
  • attribute name - optionally this selector can extract an attribute of the selected element. If specified the extractor will also add this attribute to the resulting JSON.

Use cases

Extract article references

For example you are extracting a news article that might have multiple reference links. If you are selecting these links with link selector with multiple checked you would get duplicate articles in the result set where each record would contain one reference link. Using grouped selector you could serialize all these reference links into one record. To do that select all reference links and set attribute name to href to also extract links to these sites.

Element selector

Element selector is for element selection that contain multiple data elements. For example element selector might be used to select a list of items in an e-commerce site. The selector will return each selected element as a parent element to its child selectors. Element selectors child selectors will be extracting data only within the element that the element selector gave them.

Note! If the page dynamically loads new items after scrolling down or clicking on a button then you should try these selectors:

Configuration options

  • selector - CSS selector for the wrapper elements that will be used as parent elements for child selectors.
  • multiple - multiple records are being extracted (almost always should be checked). Multiple option for child selectors usually should not be checked.

Use cases

Select multiple e-commerce items from a page

For example an e-commerce site has a page with a list of items. With element selector you can select the elements that wrap these items and then add multiple child selectors to it to extract data within the items wrapper element. Figure 1 shows how an element selector could be used in this situation.

Fig. 1: Multiple items selected with element selector

Extract data from tables

Similarly to e-commerce item selection you can also select table rows and add child selectors for data extraction from table cells. Though Table selector might be much better solution.

Element scroll down selector

This is another Element selector that works similarly to Element selector but additionally it scrolls down the page multiple times to find those elements which are added when page is scrolled down to the bottom. Use the delay attribute to configure waiting interval between scrolling and element search. Scrolling is stopped after no new elements are found. If the page can scroll infinitely then this selector will be stuck in an infinite loop.

Configuration options

  • selector - CSS selector for the element.
  • multiple - multiple records are being extracted (almost always should be checked). Multiple option for child selectors usually should not be checked.
  • delay - delay before element selection and delay between scrolling. This should usually be specified because the data won't be loaded immediately from the server after scrolling down. More than 2000 ms might be a good choice if you you don't want to loose data because the server didn't respond fast enough.

Use cases

See Element selector use cases.

Element click selector

Element click selector works similarly to Element selector. It's main purpose also is element selection that could be given as parent elements to its child selectors. The only difference is that Element click selector can interact with the web page by clicking on buttons to load new elements. For example a page might use JavaScript and AJAX for pagination or item loading.

Note! when selecting clickable elements you should select them by moving the mouse over the element and pressing "S". This kind of selection will avoid events triggered by the button.

Configuration options

  • selector - CSS selector for the wrapper elements that will be used as parent elements for child selectors.
  • click selector - CSS selector for the buttons that need to be clicked to load more elements.
  • click type - type of how the selector knows when there will be no new elements and clicking should stop.
  • click element uniqueness - type of how selector knows which buttons are already clicked.
  • multiple - multiple records are being extracted (almost always should be checked). Multiple option for child selectors usually should not be checked.
  • delay - delay before element selection and delay between clicking. This should usually be specified because the data won't be loaded immediately from the server. More than 2000 ms might be a good choice if you you don't want to loose data because the server didn't respond fast enough.
  • Discard initial elements - the selector will not return the elements that were available before clicking for the first time. This might be useful for duplicate removal.

Click type

Click Once

Click Once type will click on the buttons only once. If a new button appears that can be selected it will be also clicked. For example pagination links might show pages 1 to 5 but pages 6 to 10 would appear some time later. The selector will also click on those buttons.

Click More

Click More type makes the selector click on given buttons multiple times until there are no new elements appearing. A new element is considered an element that has unique text content.

Click element uniqueness

When using Click Once only unique buttons will be clicked. When using Click More this helps to ignore buttons that don't generate more elements.

  • Unique Text - buttons with identical text content are considered equal
  • Unique HTML+Text - buttons with identical HTML and text content are considered equal
  • Unique HTML - buttons with identical HTML and stripped text content are considered equal
  • Unique CSS Selector - buttons with identical CSS Selector are considered equal

Use cases

Navigate pagination using "Click once" selector type

For example there is a site that displays a list of items and there are some pagination buttons that reload these items dynamically (after clicking a button the url doesn't change. changes after hash tag # doesn't count). Using Element click selector you can select these items and buttons that need to be clicked. The scraper during scraping phase will click these buttons to extract all elements. Also you need to add child selectors for the Element click selector that select data within each element. In figure 1 you can see how to configure the Element click selector to extract data from the described site.

Fig. 1: Sitemap when using Click once type

Load more items in an e-commerce site by clicking "More" button

This example is similar to the one above. The only difference is that in this site items are loaded by clicking a single button multiple times. In this case the Element click selector should be configured to use "Click more" click type. In figure 2 you can see how to configure the Element click selector to extract data from this site.

Fig. 2: Sitemap when using Click more type

CSS selector

Web Scraper uses css selectors to find HTML elements in web pages and to extract data from them. When selecting an element the Web Scraper will try to make its best guess what the CSS selector might be for the selected elements. But you can also write it yourself and test it with by clicking "Element preview". You can use CSS selectors that are available in CSS versions 1-3 and also pseudo selectors that are additionally available in jQuery. Here are some documentation links that might help you:

Additional Web Scraper selectors

It is possible to add new pseudo CSS selectors to Web Scraper. Right now there is only one CSS selector added.

Parent selector

CSS Selector _parent_ allows a child selector of an Element selector to select the element that was returned by the Element selector. For example this CSS selector could be used in a case where you need to extract an attribute from the element that the Element selector returned.

Storage backends

Web scraper can be configured to use either local storage or CouchDB. By default all data is stored in the local storage.

Local storage

Local storage backend uses browsers built in database to store data. This data is not replicated from one chrome instance to another.

CouchDB

CouchDB is a RESTful NoSQL JavaScript database. You can configure the extension to store sitemaps and scraped data in this database. The data then could be accessible from all your chrome instances. To do that you need to configure it in the options page. You can open it by right clicking extensions icon and selecting options. There you can switch between storage backends. For CouchDB you need to add configure the database where sitemaps will be storend and the couchdb db server where scraped data will be stored. For example you can configure it like this:

  • sitemap db - http://localhost:5984/scraper-sitemaps
  • data db - http://localhost:5984/