Implementataion of Naive Bayes in python(using Sklearn)

Naive Bayes Classifier is a classification algorithm based on Bayes’ Theorem of probability. It is based on the principle that the predictors are independent of each other. In other words, we can say that the Naive Bayes classifier assumes that the presence of a particular feature in a class is independent with the presence of any other feature in the same class.
 
Let’s understand this concept by an example, suppose a fruit may be considered to be an orange if it is orange in color, approximately round, and about 2.5 inches in diameter. Here we can see that all of these properties independently contribute to the probability that this fruit is orange, even if these features depend on each other. This is the reason, why it is known as ‘Naive’. (Naive meaning: Unaffected).
 
Naive Bayes algorithm is simple to understand and easy to build. It do not contain any complicated iterative parameter estimation. We can use naive Bayes classifier in small data set as well as with the large data set that may be highly sophisticated classification.
To read more on Naive Bayes classifier visit – Naive Bayes Classifier tutorial
 
In this article, we will implement the naive Bayes classifier using python(scikit learn) for the simple classification problem based on iris dataset. You can download iris dataset here.We will classify the different iris species based on the length and width of their sepals and petals. Thus the feature vector will be ( ‘sepal.length’   ‘sepal.width’    ‘petal.length’  ‘petal.width’) and corresponding label will be species/ variety
 
 Now Lets make naive bayes classifier step by step in python using scikit learn(sklearn):
 
  • Step 1: loading data file into the program. give the location of your csv file

    import pandas as pd
    dataset = pd.read_csv(“E:/input/iris.csv”)
    print(dataset.head()) # this statement prints first five tuples of your data.

  • Step 2: In this step we divide our data column wise into feature vector and label.

    X = dataset.iloc[:, :-1].values # slice dataset upto last column(without including last column) as feature vector and store in X matrix.

         y = dataset.iloc[:, 4].values # store the last column as label of data in y.

 
 
  • Step 3: In this step we divide our training dataset into two subset as training and test set. The trainng subset is used to train the model and the trained model is tested on the test subset. This is called cross validation. We do this inorder to check how accurately is our classifying model perdicting.

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

  • Step 4: This step includes normalization of data. Normalization might not be used in all cases, however if the data contains high variance, then use of normalization technique is highly recommended. In some of the cases if we use our data without normalization, then it may leads to undesired prediction and fault in prediction model.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

  • Step 5: Once our data is ready, we import the naive bayes model included inside the scikit learn(sklearn) library. We initialize the classifier by defining its instance(we have defined instance named ‘classifier’). The fit(x,y) function is used to train the model. It takes the feature vector and their labels as argument(e.g. classifier.fit(X_train, y_train)). To perform prediction a function predict() is used that takes test data as argument and returns their predicted labels(e.g. y_pred = classifier.predict(X_test) )

    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)

  • Step 6: we can check the performance of classifier with the help of various classification mertices like accuracy, precision, recall, f1 score etc. classification_report, confusion_matrix functions are used to calculate those metrices. For more on classification metrices and confusion matrix visit here.

    from sklearn.metrics import classification_report, confusion_matrix
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))


    Testing model by suppplying ramdom data

    x_random = [[-1.56697667 , 1.22358774, -1.56980273, -1.33046652],
    [-2.21742620 , 3.08669365, -1.29593102,-1.07025858]]
         
          y_random=(classifier.predict(x_random))
          print(y_random)
 
          The output of the above program is:

           sepal.length   sepal.width    petal.length   petal.width   variety
0                  5.1                  3.5                 1.4                 0.2                 Setosa
1                  4.9                  3.0                 1.4                 0.2                 Setosa
2                  4.7                  3.2                 1.3                 0.2                Setosa
3                  4.6                  3.1                 1.5                 0.2                Setosa
4                  5.0                  3.6                 1.4                 0.2                Setosa
[[16 0 0]
[ 0 8 0]
[ 0 0 6]]
                   precision        recall     f1-score      support

Setosa        1.00                 1.00        1.00             16
Versicolor 1.00                 1.00        1.00              8
Virginica   1.00                 1.00        1.00              6

micro avg  1.00                1.00        1.00             30
macro avg 1.00                1.00        1.00             30
weighted avg 1.00           1.00        1.00             30

[‘Setosa’ ‘Setosa’]

Also read:

  1. Implementation of KNN from scratch
  2. Implementation of KNN using Scikit learn
  3. Basic of SVM

 

Leave a Reply

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert
%d bloggers like this: