Data Cleaning with OpenRefine
August 14, 2020
OpenRefine, Data, Data transformation, Tutorial
Data transformation is a step for preparing data sets for AI training and analysis. Clean and transformed data is a vital part of precise and correct data reports and analysis. In this blog series part, we will look at what importance it is to detect and delete blank, inconsistent and duplicate data entries.
NOTE! You can follow the process by downloading the example data we have provided here, or/and watch the YouTube tutorial video here. To read the first part of the "Data Transformation with OpenRefine" visit our previous blog post.
Before we begin the transformation, we have to point out two very important features that OpenRefine provides:
When data transformation is not going your way and you have made changes to data that do not compel with the goal structure, OpenRefine provides a feature of forgiveness and allows you to fix the problems that were caused by wrong transformations. The Undo/Redo feature, located on the left side-tool, works like a “time machine”. Whenever you make a change to the data, it logs it in the feed, to redo or undo specific changes, all you have to do is find it in the log, click on it and OpenRefine will transform the data back.
Faceting is the feature that brings the most detective work of data with OpenRefine. It allows the user to split up data for a specific character. What a facet does is that it groups the values that are in a column, and enables the user to filter through and edit the values across various cells simultaneously, working as a browsing mechanism for seeing a bigger picture of your data and detecting any inconsistencies. Faceting will be the one feature we will be using the most in these blog series, especially in this part.
Let us start our data transformation by making sure that there are no blank entries that could create inconsistencies in our data.
The first step is to decide which unique metric column (a metric column like IDs, SKUs, Surnames, emails, etc, such that can not be categorized) we are going to take as our indicator for luring out any blank entries.
In the case of our data, the “product-id” column will be the indicator.
Select the column “product-id” -> Facet -> Numeric facet.
A log on the left-hand side will open up; however, will show that there are no numeric values present.
To fix this, select the column again -> Edit cells -> Common transforms -> To number.
Now, on the left-hand side, the tab will show a data ratio graph with the number of numeric, non-numeric, blank, error entries.
As it shows, our data consists of two non-numeric “product-id” rows. To see the records, un-tick the “Numeric” box of the facet log, and it shows that there are two blank entries with non-consistent “product-id” values. Such entries, especially in larger volumes, can clutter the data; therefore we are going to delete the two blank records.
To do so, leave only the “non-numeric” tab ticked,
Click on All (column) -> Edit rows -> Remove matching rows.
Delete the facet by clicking on the “x” on the upper left side of its tab and the complete data, now without the blanks, will show.
That is how blank, illogical data entries can be disposed of in OpenRefine.
Now we are going to take a look at duplicates. To delete duplicate entries, again - select the column that data should contain unique values (that is not category data, color descriptions, etc. Do select such data as ID numbers, Names, Addresses, etc), in this case, it is “product-id”.
Click to Sort the column.
Then select “Sort cell values as NUMBERS” and click “OK”.
Since the “sort” function works as a filter, it is necessary to make the sort filter permanent for the duplicate removal to be successful; therefore, select “Reorder rows permanently” in the upper toolbar.
Now, we are going to delete the repeated values.
That can be done by the “blank down” feature.
Click on the column -> Edit cells -> Blank down.
To remove the duplicates entirely, select the column -> Facet -> Customized facets -> Facet by blank (null or empty string).
In the facet tab, select “true”, then from the “All” column -> Edit rows -> Remove matching rows.
This data transformation step might take a while for Open Refine to process since we are working with big data that consist of over one million rows of data.
Once the duplicates have been deleted - close the facet bar.
The unique-indicator is left, to delete it we flag it.
Now create a facet by flag from the "All" column.
Select "true" of the facet log, and then "Remove All Matching rows" from the "All" column.
That is it!
As can be observed - more than a million blanks and duplicates have been deleted. That is a lot of unnecessary, inconsistent data, and thanks to data transformation, we now have cleaner and more accurate data to work with.
In the following blog-series of data transformation with OpenRefine, we will take a look at clustering, data entry transformation, and data merging.
We have also created a tutorial-type videos showing everything in motion on our YouTube channel.