KNN probably is one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. Knn is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance).
|Classification for k=3|
K-NN works pretty well with a small number of input variables (p), but there are more chances of error in prediction when the number of inputs becomes very large.
- Euclidean Distance:
|fig: Euclidean Distance|
- Manhattan Distance:
|fig: Manhattan Distance|
- Minkowski Distance:
|fig: Minkowski Distance|
For p=1, we get Manhattan Distance and for p=2, we get Euclidean Distance. So, we can say that Minkowski distance is generalized form of Manhattan Distance, Euclidean Distance.
Algorithm for K-NN:
Decision Boundary for K-NN:
|Decision boundary for classification using K-NN algorithm
Example Of KNN(using Scikit learn)
We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have all together 150 observations(tuples) and we will make KNN classifying model on the basis of these observations.Link to download iris dataset- iris.csv
dataset = pd.read_csv(“E:/input/iris.csv”)
print(dataset.head()) # prints first five tuples of your data.
Step-2: Now, we split data row wise into attribute/features and their corresponding labels.
X = dataset.iloc[:, :-1].values # splits the data and make separate array X to hold attributes.
y = dataset.iloc[:, 4].values # splits the data and make separate array y to hold corresponding labels.
Step-3: In this step, we divide our entire dataset into two subset. one of them is used for training our model and the remaining one for testing the model. we divide our data into 80:20 i.e. first 80% of total data is training data and remaining 20% is our test data. We divide both attributes and labels. We do this type of division to measure the accuracy of our model. This process of spiting our supplied dataset into training and testing subsets in order to know the accuracy and performance of our model is called cross-validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Step-4: In this step, we perform normalization/standardization. It is process of re-scaling our data, so that the variations present in our data will not affect the accuracy of model. we have used z-score normalization technique here. For more on normalization, click here.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Step-5: Now its time to define our KNN model.We make a model,and supply attributes of test subset for the prediction.
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=9) #defining KNN classifier for k=9.
classifier.fit(X_train, y_train) #learning process i.e. supplying training data to model
y_pred = classifier.predict(X_test) #stores prediction result in y_pred
Step-6: Since the test data we’ve supplied to the mdel is a portion of training data, so we have the actual labels for them. In this step we find the magnitudes of some classification metrices like precision, recall, f1-score etc.
from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test, y_pred))print(classification_report(y_test, y_pred))
Step-7: supply actual test data to the model.
# testing model by suppplying ramdom data
x_random = [[-1.56697667 , 1.22358774, -1.56980273, -1.33046652],
[-2.21742620 , 3.08669365, -1.29593102,-1.07025858]]
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
[[11 0 0]
[ 0 9 0]
[ 0 0 10]]
Classification metrices for test data:
precision recall f1-score support
Setosa 1.00 1.00 1.00 11
Versicolor 1.00 1.00 1.00 9
Virginica 1.00 1.00 1.00 10
micro avg 1.00 1.00 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
For actual test data: