Cross-validation is a technique to evaluate the predictive models by splitting the original training data sample into a training set to train the model, and a test set to evaluate it. Cross-validation is a re-sampling process used to evaluate the model if we have limited amount of data.It is one of the widely used techniques used to test the effectiveness of a machine learning models. To perform cross-validation in machine learning we need to keep aside a portion of the given data as train dataset on which we train the machine learning model and we use the remaining portion of data as test dataset which is used for testing/validating.
Cross-validation is also known as rotation estimation.
There are many methods of cross-validation. Few of them are as follows:
- Train-Test spilt: In this method, we split the complete data randomly into training and test datasets. After that we Perform the model training on the training set and use the test set for validation purpose. Mostly the data is split into 70:30 or 80:20 of Train:Test. Using this method there is a high possibility of high bias if we have limited amount of data, because we would miss some information about the data which we have not used for training. If our data is huge and our test sample and train sample has the same distribution (i.e. the train data and test data have almost same nature) then this approach is acceptable.
- K-fold Cross-validation: In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or “folds,” as D1, D2, : : : , Dk, each having approximately equal size. The training and testing processes are performed k times. In iteration i, partition Di is reserved as the test set,and the remaining partitions are collectively used to train the model. That is, in the first iteration, if the subsets D2, : : : , Dk are collectively served as the training set to obtain a first model, then the subset D1 is treated as test data.i.e. the testing of the model is done over D1 ; then the second iteration is trained on subsets D1, D3, : : : , Dk and tested on D2; and so on. Here each sample is used the same number of times for training and once for testing. For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data.
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is “left out” at a time for the test set. In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data.