Supervised Machine Learning
What Is Supervised Learning?
It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data.
Supervised Learning algorithm has two categories i.e Classification & Regression
Classification predicts the class or category in which the data belongs to.
e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification.
Regression predicts a numerical value based on previously observed data.
e.g.: House Price Prediction, Stock Price Prediction.
Classification is one of the widely and mostly used techniques for determining class the dependent belongs to base on the one or more independent variables. For simple understanding, what classification algorithm does is it simply makes a decision boundary between data points (feature vectors) separating similar data points with dissimilar ones.
Some of the most common classification algorithms are discussed briefly below:
1. K-Nearest Neighbors (K-NN)
It is one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes in order to make prediction for new sample data points. It is a non-parametric, lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance).
In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value. The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is very large then it leads to High Bias, Low variance i.e. under fitting of model. There are many researches done on selection of right value of K, however in most of the cases taking ‘K’ = square-root (total number of data ‘n’) gives pretty good result.
KNN works pretty well with a small number of input variables (p), but there are more chances of bad prediction when the number of inputs becomes very large.
For detailed article about K-NN classifier click here
2. Support Vector Machine (SVM)
Support Vector is one of the mathematically complex supervised learning algorithm used for both regression and Classification. It is strictly based on the concept of decision planes (most commonly called hyperplanes) that define decision boundaries for the classification. A decision plane is one that separates between a set of data having different class memberships.
It performs classification by finding the optimal hyperplane that maximizes the margin between the two classes with the help of support vectors.
For linearly separable data, learning is done by finding an optimal hyperplane between the classes.
For non-linearly separable data, kernels are used. Kernels can be considered as functions that take data as input and transform it into required form.
|Effect of Kernel Function|
In the SVM algorithm, the kernel SVM takes a kernel function and transforms it into the required form that maps data to a higher dimension that can be separated.
Some of the most common types of kernel function are:
Linear Kernel: K(Xi,Xj) = Xi.Xj
Polynomial kernel: K(Xi,Xj) =( γXi.Xj+C)d , where d is the degree of the polynomial that should be specified.
RBF Kernel: K(Xi,Xj) =exp(- γ|Xi -Xj|2), it is used for non-linearly separable variables. For distance metric squared Euclidean distance is used.
Sigmoid kernel: K(Xi,Xj) =tanh( γXi.Xj+C), it is similar to logistic regression is used for binary classification
Kernel trick uses the kernel function to transform the data into a higher dimensional feature space to make it possible to perform the linear separation for classification.
So, it is better to use linear SVMs for linear problems, and non-linear kernels such as the sigmoid kernel, Radial Basis Function kernel for non-linear problems.
For detailed article about SVM classifier click here
Naive Bayes classifier is based on Bayes’ theorem of probability. According to Bayes theorem, the probability that we want to calculate P(A|B) can be given in terms of P(A),P(B|A) and P(B) as,
The principle of Naïve Bayes classifier is that every feature being classified is independent of the value of any other. A Naive Bayes model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets.
For details about Naïve Bayes Classifer click here
4. Decision Tree Classification
Decision trees are one of the strongest but simple supervised learning algorithms used for classification or regression in the form of a tree structure. So it is also called CART (Classification and Regression Trees).
Decision tree resembles with flowchart like structure in which each node represents a ‘test’ on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. It follows Iterative Dichotomiser 3(ID3) algorithm structure for determining the split of nodes.
|Decision tree components|
ID3 algorithm uses Entropy and Information Gain to construct a decision tree.
In Layman terms, Entropy is measure of disorder or uncertainty. In Machine Learning, entropy is used to calculate the homogeneity of a sample. Lower is the entropy of sample higher is its homogeneity. In other words, entropy tells about the predictability of any event. It is denoted by H(S) or E(S)
The mathematical formula to calculate the entropy is as follows:
|Mathematical formula of entropy|
Information gain is the important measure used by Decision Tree Algorithms to construct a Decision Tree. Decision Trees algorithm will always tries to maximize Information gain. An attribute with highest Information gain will tested/split first. Information gain is measured using the following formula:
|Information gain Formula|
Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the Entropy of the entire set, while the second term calculates the Entropy after applying the feature X.