- Step 1: loading data file into the program. give the location of your csv file
import pandas as pd
dataset = pd.read_csv(“E:/input/iris.csv”)
print(dataset.head()) # this statement prints first five tuples of your data.
- Step 2: In this step we divide our data column wise into feature vector and label.
X = dataset.iloc[:, :-1].values # slice dataset upto last column(without including last column) as feature vector and store in X matrix.
y = dataset.iloc[:, 4].values # store the last column as label of data in y.
- Step 3: In this step we divide our training dataset into two subset as training and test set. The trainng subset is used to train the model and the trained model is tested on the test subset. This is called cross validation. We do this inorder to check how accurately is our classifying model perdicting.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
- Step 4: This step includes normalization of data. Normalization might not be used in all cases, however if the data contains high variance, then use of normalization technique is highly recommended. In some of the cases if we use our data without normalization, then it may leads to undesired prediction and fault in prediction model.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
- Step 5: Once our data is ready, we import the naive bayes model included inside the scikit learn(sklearn) library. We initialize the classifier by defining its instance(we have defined instance named ‘classifier’). The fit(x,y) function is used to train the model. It takes the feature vector and their labels as argument(e.g. classifier.fit(X_train, y_train)). To perform prediction a function predict() is used that takes test data as argument and returns their predicted labels(e.g. y_pred = classifier.predict(X_test) )
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
y_pred = classifier.predict(X_test)
- Step 6: we can check the performance of classifier with the help of various classification mertices like accuracy, precision, recall, f1 score etc. classification_report, confusion_matrix functions are used to calculate those metrices. For more on classification metrices and confusion matrix visit here.
from sklearn.metrics import classification_report, confusion_matrix
Testing model by suppplying ramdom data
x_random = [[-1.56697667 , 1.22358774, -1.56980273, -1.33046652],
[-2.21742620 , 3.08669365, -1.29593102,-1.07025858]]
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
[[16 0 0]
[ 0 8 0]
[ 0 0 6]]
precision recall f1-score support
Setosa 1.00 1.00 1.00 16
Versicolor 1.00 1.00 1.00 8
Virginica 1.00 1.00 1.00 6
micro avg 1.00 1.00 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30