# Easy Explanation of Data Normalization/Standardization in Machine Learning(using Python)

What is Data Normalization or Standardization in Machine Learning?

Data normalization or standardization is defined as the process of rescaling original data without changing its behavior or nature. We define new boundary (most common is (0,1),(-1,1)) and convert data accordingly. Data normalization technique is useful in classification algorithms involving neural network or a distance based algorithm (e.g. KNN, K-means).

**a)**

**Min-Max Normalization/standardization:**It performs linear transformation on original data. Let (X1,X2) be min and max boundary of an attribute and (Y1,Y2) be the new scale at which we are normalizing, then for V

_{i}value of attribute, the normalized value U

_{i}is given as,

Min-Max Normalization |

Min-max normalization preserves the relationship among the original data values. If in future the input values comes to be beyond the limit of normalization, then it will encounter an error known as “out-of-bound error.”

Let’s see an example: Suppose the minimum and maximum values for price of house be $125,000 and $925,000 respectively. We need to normalize that price range in between 0,1, . We can use min- max normalization to transform any value between them (say, 300,000). In this case we use above formula with,

Vi=300,000

X1= 125,000

X2= 925,000

Y1= 0

Y2= 1

**In python: **

`[0, 1]`

range:from sklearn import preprocessingimport numpy as np X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]) min_max_scaler = preprocessing.MinMaxScaler() X_train_minmax = min_max_scaler.fit_transform(X_train) print(X_train_minmax)

**output:**

**b) ****Z-score Normalization/standardization( Zero mean normalization /****standardization****) :** In this technique, the values are normalized based on the mean and standard deviation of attribute A. For V_{i } value of attribute A, normalized value U_{i} is given as,

Z-score Normalization(Zero mean normalization) |

where Avg(A) and Std(A) represents the average and standard deviation of values attribute A respectively.

Let’s see an example: Suppose that the mean and standard deviation of values for attribute *income *$54,000 and $16,000 respectively. With z-score normalization, a value of $73,000 for income is transformed to (73,000-54,000)/16,000=1.225.

**In Python:**

from sklearn.preprocessing import StandardScaler X=[[101,105,222,333,225,334,556],[105,105,258,354,221,334,556]] print("Before standardisation X values are ", X) sc_X = StandardScaler() X = sc_X.fit_transform(X) print("After standardisation X values are ", X)

**output:**

**c)**

**Decimal Normalization/**

**standardization**

**:**In this method, we normalize the given value by moving the decimal points of the value. The number of decimal points to move is defined by the maximum absolute value of given data set.If V

_{i }value of attribute A, then normalized value U

_{i}is given as,

Decimal Normalization |

_{i}|<1.

Lets understand it by an example: Suppose we have data set in which the value ranges from -9900 to 9877. In this case the maximum absolute value is 9900. So to perform decimal normalization, we divide each of values in data set by 10000 i.e j=4.(since it near to 9900).

## Why is data normalization important?

Let’s understand it by an example. Suppose we are making a predictive model using dataset that contains the net worth of citizens of a country. For this data set we find that there is large variation in data. If we feed this data to train any model, then it may generate some undesirable results. So, to get rid of that we opt normalization.

I have written this article taking reference of book ‘DATA MINING Concepts and techniques’ by Jiawei Han, Micheline Kamber, and Jian Pei. You can download this book free here

Also read- Understand KNN with examples

really informative 🙂 🙂

Pingback:Implementataion of Naive Bayes in python(using Sklearn) | KRAJ Education

Pingback:Introduction To Gradient Descent algorithm and its variants | KRAJ Education