Introduction to OpenRefine
July 28, 2020
Data, Data transformation
As the saying goes “garbage in, garbage out”. This can be associated with the idea of data analysis without data transformation beforehand. With bad, messy data, only lousy, chaotic analysis can be done. However, with data transformation, not only the process of analysis becomes easier but also the precision increases.
OpenRefine, formerly known as Google Refine, is an open-source tool that will bring your data analysis to the next level without any extra costs. Using a web browser as a graphical interface, OpenRefine works as a desktop application that allows the user to transform, enrich, join, and unclutter data.
Such examples as incomplete records, spelling mistakes, unnecessary strings, duplicate records are where OpenRefine works as the near-perfect transformation tool.
With the high increase in data, there are many data transformation tools being developed and already available online; however, most of them are quite costly and complicated. OpenRefine can work as a great introduction to how data can be cleaned, transformed, and what advantages data transformation can bring.
Later on in the series, we will take a look at the differences between the data transformation capabilities of Excel and OpenRefine to better show and explain why, for data transformation purposes, OpenRefine performs better with big data.
To access OpenRefine visit their webpage and follow the install instructions.
NOTE: As mentioned previously, OpenRefine runs in a browser; however, does not require an internet connection to run. It will automatically run on your default browser.
With OpenRefine installed, it is finally time to approach the actual practice of data transformation. We have provided an example file to work with and follow our upcoming series tutorial where we are gonna dive deeper into the possibilities of data cleanup and transformation with OpenRefine.
To download the sample file with the generated data, created especially for the blog-series tutorial click here.
Let us begin by creating a project.
Open up OpenRefine, import a file by clicking the “Browse…” button, and select the necessary file.
Click “Next”, and a log with possible file configuration options will appear. It is important to make sure that the correct box of column separators are selected. For the data we are using in the example, no other configuration is needed.
Change the project name in the upper right corner if necessary. Also, it is possible to create tags for your projects so it is easier to filter through them in the future. Click “Create Project” in the upper right corner to launch the project.
For some, the layout can seem quite straightforward, for others, it appears as difficult and complex. Toggle around, get acquainted with the layout and interface.
For the next blog-series part, we are going to dive into faceting, reversing changes, and deletion of blanks and duplicates.