What is data cleaning?
Data cleaning is the process of preparing data for analysis by modifying or removing the data that is inaccurate, incomplete, meaningless, duplicated, or formatted inappropriately. Usually, this data is not essential or helpful when it comes to analyzing data as it may disturb the further processes or may generate inaccurate or irrelevant outputs. There are several methods for data cleaning data depending upon the nature of data.
The purpose of data cleaning is not merely to erase information to create room for fresh data, but to find a manner to improve the precision of a data set without necessarily deleting information.
For one, data cleaning includes more actions than removing data, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes such as empty fields, missing codes, and identifying duplicate data points. Data cleaning is considered as a foundation element of the data science basics because it plays an important role in the analysis and uncovering the reliable answers that are hidden inside the data.
What are the techniques of Data Cleaning?
a) Handling(Filling) missing values: The missing values in data can be filled using any of the techniques mentioned below:
i. Ignoring/Dropping: In some of the cases it is better to ignore or drop a tuple that contains missing value rather than filling it. Generally this is practiced in large dataset, where excluding some tuples does not affect the information conveyed by the data. But it is discouraged for small dataset as it might lead to losing of important information.
ii. Fill Missing values manually: You can also fill the missing values manually by understanding the nature of data. Usually, this is performed in small dataset rather than large dataset as it is more time consuming in case of large dataset.
iii. Filling Central values (Mean/Median) in missing values: This technique is far better than the above mentioned ones. In these techniques we insert the mean or median of respective attribute to the missing values. For better results first we group the data on the basic of similarities of attributes and apply this technique.
iv. Interpolation: This is one of the reliable, accurate and scientific ways of filling missing value. According to interpolation technique, we first develop relation among the attributes and then predict the most probable and accurate value for the missing places.
This can be achieved by regression, Bayesian formulation, and Decision tree induction.
b) Removing Noise (smoothing) from Data: What is noise in data? Actually, noise in data is any kind of random error or variance in measured attributes. The outliers present in data can also be regarded as the noise. The noise present in data may highly affect our mining result (or we can say prediction). So noisy in data is not considered as good data for mining purpose and it should be removed as far as possible. Before we remove noise let’s know how can we detect noise in our data? There are many noise detecting techniques that we can use, but the most scientific and informative technique is visualization technique. It includes visualization of different attributes of data in the form of graph or plots. Some of the informative plots includes scatter plot, box plot etc.
One of the most popular methods used for smoothing (Noise removing) our data is Binning method. Binning method is used to smooth the sorted value by looking its neighborhoods. The sorted values are distributed into number of bins (groups or buckets). This is also called as local smoothing as it consults neighbor for noise removing.
Let’s see what actually binning means from an example.
We have sorted data as – 7, 9,14,15,17,19,22,25
Bin 1 = 7, 9, 14, 15
Bin 2 = 17, 19, 22, 25
Smoothing by Bin means: We replace each members of bin by the mean of respective bins. It can be shown as:
Bin 1 = 11.25, 11.25, 11.25, 11.25
Bin 2 = 20.75, 20.75, 20.75, 20.75
Smoothing by Bin boundary: We replace the values with nearest boundary value of bin. It can be shown as:
Bin 1 = 7, 7, 15, 15
Bin 2 = 17, 17, 25, 25
Smoothing can also be done by the removing outliers. When similar values are clustered (grouped) then the values that remain outside the cluster are called outliers.