Web Scraper Cloud Parser feature release
December 23, 2019
Web Scraper Cloud, Data post processing, Parser
We are happy to finally introduce a Parser feature for Web Scraper Cloud.
Usually, to post process data, a custom written script or extra time editing the data manually in a spreadsheet software would be the case; however, the Parser takes care and eases this process.
Its modular design allows the user to create, chain and further on configure multiple parsers for each column to easily create the most suitable post processing methods, ranging from very simple to more sophisticated.
Parser includes such parser types as :
- RegEx Match;
- Replace Text;
- Remove Whitespaces;
- Strip HTML.
Each parser type takes care of a different ability to process data. The “Replace Text” parser, for example, as the name indicates, allows to replace or remove a string. However, the “Remove Whitespaces” parser helps to clean up such fields that are scraped by the Text Selector, removing any white spaces or unnecessary new lines from the text. A very great hack is that it is possible to create multiple parsers for the same column, truly allowing the best data processing method creations.
Aside from all the parser types, the Parser feature also provides such functions as creating a virtual column, allowing the user to combine information from two or more source columns and apply parsers to that virtual column. Also, a “Remove Column” function, which enables the user to remove columns, enabling the possibility of not having irrelevant data columns in the final scraped data.
Let’s take an example. After scraping multiple pages of an ecommerce web page, the data comes out looking something like this:
Data like this is not practical for further analyzing, it is hard to read and review. To adjust the data for an easier use, for this particular example we started with removing all the unnecessary columns, such as “web-scraper-order”, “web-scraper-start-url” and such, by applying the “remove column” function of the Parser feature.
Then with “Replace Text” parser we make sure that the “price” data output is without a dollar sign
And finally, we create a simple “RegEx Match” parser to extract only numbers from the “reviews” column.
After all these steps, all the necessary data process has been done and the data output will look more simplified, easier to use and analyze.
The Parser feature is easy to use. A basic knowledge of RexEx is useful; however, not obligatory. The modular design allows to chain multiple parsers together, allowing the user to create, for example, multiple very simple replace text operations than only one parser with a more complicated configuration.