CLEAN AND PREPARE DATA FOR ANALYSIS ON BIGQUERY USING SQL
When we talk about Big Data, we naturally think Volume, Velocity and Variety — referring to the incredible quantity of data we generate, perform, store and read per day (2.5 quintillion bytes of data created each day only for the web). We don’t necessarily think about Veracity and Value. The challenge today in data science is to bring insights from big data and turn it into Value. The greatest risk to make this unsuccessful is a lack of trust on data, in other words: Veracity. This makes quality of the data a genuine challenge.
Sensors, smartphones, GPS and other devices can generate quality data but also a lot of garbage. Plus, collected data are generally with missing values or “null”, fake values, duplicate records and outliers, etc. This implies to prepare the data before any analysis.
The cartoon above, from kdnuggets.com, is suggesting that robots can do this hard work. Unfortunately, it is not true for the moment. Even if, nowadays, Artificial Intelligence (AI) tends to reduce the gap between structured and unstructured data-based insights.
Today, data scientists and analysts spend 80% of their time to transform, clean and prepare data, since they know that it’s not possible to guarantee a consistent and relevant predictive model with corrupted data. (The figure below, from Forbes.com, shows the part of data preparation in data scientists work)
SQL, The wrangler in chief on BigQuery
BigQuery is one of the serverless services available on Google Cloud Platform. It’s an adaptable Datawarehouse designed to increase productivity of Data scientists. As you have no infrastructure to manage, no database administrator to hire, you can focus on finding insights for your company.
This is the promise, but in the real world, what is the practice?
In many organizations, numerous data source systems — as Sale systems, Human Resources systems, Finance systems, Logistic systems, Web activities and IoT — are stored on BigQuery. These data are rarely ready-to-use, and do not always fit new business model or are raw data.
· The UseCase: Let’s use as an example this denormalized data on BigQuery. It is about the archives of users’ events on GitHub platform. (See below part of the schema of the UseCase)
When we preview the data as below, we can see different format types in the table. This is a semi-structured data model.
· Basic count query
Let’s presume that we would like to know the most important Event type on this large data. To do so, we just need to write a query. We use SELECT and ORDER and the job is done. We use the filter to avoid counting fake or non-valid user’s profiles.
Due to the velocity guaranteed by BigQuery, the query completes, in less than 10 seconds, out of these 4 Terabytes data, and we get the result of the 10 most popular events as below.
Ok, that is great. But what if we want to make complex analysis, and not just count the events? How to guarantee the truthfulness of the result with the “null” or missing values and fake or not-valid values since the data stored are raw data?
· Extract data
Now, we would like to get more specific insights from our data. For instance, it could be nice to know the language used by actors to push a commit or to pull a request on this network, in order to know the users’ skills.
Unfortunately, there is no column in this big table with language as header. We have the same problem if we want to know the name or location of the actors. This is because data stored are semi-structured and BigQuery accept any format of data. When we explore row by row, this information appears somewhere in a json format.
Let’s check this json line example on payload column.
‘ {“repository”:{“description”:”mirror of dwm http://hg.suckless.org/dwm","homepage":"http://www.suckless.org/dwm/","watchers":38,"stargazers":38,"forks":15,"fork":false,"size":2744,"owner":"sr","private":false,"open_issues":2,"has_issues":true,"has_downloads":true,"has_wiki":true,"language":"C","created_at":"2008-08-25T15:04:52-07:00","pushed_at":"2009-11-25T06:00:13-08:00","master_branch":"master"},"actor_attributes":{"type":"User","name":"Cecile Tonglet”,”company”:””,”blog”:”http://cecton.com","location":"Belgium","email":"233746a4f3b6febf0745dc738312967ec4324d40@gmail.com"},"url":"https://github.com/cecton/dwm"} ‘
We can easily find the language, the user name and the location. Now, we can extract the information on a structured table.
To extract this information, we use JSON_EXTRACT query as below.
The data extracted below are with “null” or missing value. We should continue to clean it.
· Replace “null” and remove special characters
After extraction, the result still shows “null”, missing value and special characters that are not supported by some analysis tools or applications. To prepare the data for analysis, we will replace missing actor_name’s value by actor_login’s value, missing location by “- “ and remove special character like /, \, “”,,… etc.
We will use IFNULL and REGEXP (Regular expression) as below to complete the cleaning.
The result is as below. The “null” are replaced and special characters are removed. The dataset is prepared and ready to use.
CONCLUSION
With a few lines of SQL code, we can extract, clean and prepare raw data for analytics/data science/Machine Learning. You just need to hire people with good skills in SQL, to maintain a gain of productivity, since Google charges you each time you query your data.
Another architectural solution is to consider BigQuery as a data source. In this case, you should use drivers to import your data model into opensource tools like python or R frameworks and libraries.