Complete tutorial on Cross Validation with Implementation in python using Scikit learn (2024)

@Machine Learning #Cross Validation

CV Concepts, types & practical implications.

Deeksha Singh

Published in

Geek Culture

6 min read

Feb 25, 2022

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (3)

In this article we will be seeing theoretical concept behind Cross validation, different types of it and in last its practical implications using python & sklearn.

But before that, Why we need Cross Validation? Lets understand .

Before building any ML model with the given data, we split our dataset into test and train set in certain percentage depends upon the availability of count of dataset. Mostly, Test Set : 20 -30 % of data & Train Set : 70–80 % of datawhere, Accuracy / performance of model will be checked by the test dataset. But this 70- 30 % volume of data is randomly selected out of all datapoints which leads to fluctuation in accuracy. This is controlled by assigning a definite value to Variable random_state.

Random state will decide the splitting of data into test and train set and using a particular finite number(It can take any positive value) will ensure same results will be reproduced again and again. But for different random_state splitting of test and train will be different and hence accuracy obtained will be different and results in fluctuation of accuracy.

We need to validate the accuracy of our ML model and here comes the role of cross validation: It is a technique for evaluating the accuracy of ML models by training a models using different subsets of data for certain number of iterations. The final output of the model will be average of all. It also mitigates the effect of overfitting.

Mostly there are 6 types of cv methods:

Hold Out Method
Leave One out Cross Validation
K Fold Cross Validation
Stratified K Fold Cross Validation
Time Series Cross Validation
Repeated Random Test-Train Splits or Monte Carlo cross-validation

Lets see one by one:

This is simply splitting the data into training & test set. Percentage of training data is more than test data. Post that using training set for training the model and remaining test set for error estimation.

Disadvantages : There will be chances of high variance because any random sample of data and pattern associated with it may get selected into test data. Since we are validating model with test data, accuracy and model generalization would be negatively affected.

In this, out of all data points one data is left as test data and rest as training data. So for n data points we have to perform n iterations to cover each data point.

Leave-P-Out Cross Validation is a special case leaving p datapoints for testing and validation and n-p for training the model.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (4)

Shortcomings :

i) High Computing power is required since many iterations are required for each data point of large datasets.

ii) Since n-1 data points are used as training data, overfitting will happen results in low bias but it won’t produce a generalized model resulting in high error and low accuracy. It was used long back, now a days no one uses it.

In this, Whole n dataset is divided into k parts with n/k =p and then this p will be taken as test data in each iteration and next p in next iteration and so on till k iterations . For ex. 20 datapoints for 5 fold cross validation, 20/5 =4, so the given dataset will be divided as shown in below image. Sets will be different for different fold. Each data will be considered one time in test set and k-1 times in training set enhancing the effectiveness of this method. Each fold will give different accuracy and final accuracy will be average of all these 5 accuracies. Also, we will be able to obtain minimum and maximum accuracy of this particular model.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (5)

Advantages:

i) Efficient use of data as each data point is used for both training and testing purpose.

Low bias because most of the data is used for training.

Low variance because almost each datapoint is used in test set as well.

ii)Accuracy is high.

Ideally a value between 5 -10 is preferred for K. But it can take any value. Higher value of K will leads in accuracy similar to LOOCV method.

Disadvantages :

i) Imbalanced dataset results low accuracy with this method , Lets say for a binary classification problem, in test data we have maximum instances of output 1 so it won’t give accurate result with respect to particular model. Or in price prediction, all the data selected for test set have high price so again accuracy will be affected.

To overcome this, we use stratified Cross Validation.

In this, random sample populated in train and test dataset is such that the number of instances of each class in each iterations of training and test data splitting is taken in good proportion of yes and no, 0 & 1, or highs and lows so that model gives good accuracy.

It is completely for time series data like stock price prediction, sales prediction. Input is sequentially getting added into the training data as shown below.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (6)

It involves both traditional train test split and K-fold CV. Here random splitting of dataset is done into train and test set and then further process of splitting and performance measurement is repeated for number of times specified by us. Cross Validation is performed.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (7)

Disadvantages:

i)It is not suitable for imbalance dataset.

ii)Chances are that some samples didn’t get selected for either of train and test data.

Now we are implementing all above techniques using python and sklearn for building a simple ML model. It’s simply for understanding the Cross validation techniques so other hyperparameters of Regressor Classifiers are at default value.

We are considering a dataset of cancer to predict the type of cancer on the basis of various feature i.e, Benign (B) & Malignant (M).

import pandas as pd
data=pd.read_csv(r'/content/drive/MyDrive/cancer_dataset.csv')
data.head()

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (8)

#Removing Null Values
data.isnull().sum()

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (9)

#last column have all NaN value so we can drop thatdata1=data.drop(['Unnamed: 32'],axis='columns')### Dividing dataset into dependent & independent feature#diagnosis is the output and rest all are input features.x=data1.iloc[:,2:]
y=data1.iloc[:,1]#to check if the dataset is balanaced or noty.value_counts()

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (10)

Now we are building ML models using Different CV techniques.

1.HoldOut Validation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=DecisionTreeClassifier()
model.fit(x_train,y_train)
mod_score1=model.score(x_test,y_test)
mod_score1

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (11)

2. Leave One Out Cross Validation(LOOCV)

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
model=DecisionTreeClassifier()
leave_val=LeaveOneOut()
mod_score2=cross_val_score(model,x,y,cv=leave_val)print(np.mean(mod_score2))

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (12)

3.K Fold Cross Validation

from sklearn.model_selection import KFold
model=DecisionTreeClassifier()
kfold_validation=KFold(10)import numpy as np
from sklearn.model_selection import cross_val_score
mod_score3=cross_val_score(model,x,y,cv=kfold_validation)
print(mod_score3)#Overall accuracy of the model will be average of all values.
print(np.mean(mod_score3))

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (13)

4. Stratified K-Fold Cross Validation

from sklearn.model_selection import StratifiedKFold
sk_fold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()
mod_score4=cross_val_score(model,x,y,cv=sk_fold)
print(np.mean(mod_score4))
print(mod_score4)

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (14)

5.Repeated Random Test-Train Split

from sklearn.model_selection import ShuffleSplit
model=DecisionTreeClassifier()
s_split=ShuffleSplit(n_splits=10,test_size=0.30)
mod_score5=cross_val_score(model,x,y,cv=s_split)
print(mod_score5)
print(np.mean(mod_score5))

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (15)

With this, we have covered almost every point of cross validation.

Thanks for reading !!

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (2024)

FAQs

How would you implement cross-validation in Python? ›

Here are the steps involved in cross validation:

You reserve a sample data set.
Train the model using the remaining part of the dataset.
Use the reserve sample of the test (validation) set. This will help you in gauging the effectiveness of your model's performance.

Feb 26, 2024

How does cross-validation work in sklearn? ›

Cross-validation is a statistical method for evaluating the performance of machine learning models. It involves splitting the dataset into two parts: a training set and a validation set. The model is trained on the training set, and its performance is evaluated on the validation set.

Know More ›

What is the cross Val score function? ›

The cross_val_score function evaluates the model's performance on each data point, providing a better understanding of the model's behavior and weaknesses. In the illustration below, we can see how the cross_val_score splits the data set into training and testing data if the number of cross-validations is set to 5 .

View Details ›

How to use cross val predict? ›

Steps to implement cross_val_predict

Import the necessary libraries. Before we can use cross_val_predict , we need to import the required libraries from sklearn: ...
Load and prepare the data. ...
Create an estimator. ...
Generate cross-validated predictions. ...
Analyze the predictions.

Aug 16, 2023

Show Me More ›

What is the best cross-validation method? ›

K-fold cross-validation is one of the most widely used cross-validation methods. It divides your data into k equal-sized folds and uses one of them as the test set and the rest as the training set. This process is repeated k times, each time using a different fold as the test set.

See Details ›

Can you explain how cross-validation works? ›

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.

Explore More ›

How to use cross-validation to improve model? ›

There is a way to overcome this problem: by not using the entire dataset during model training. You can remove some of the data and then train your model on the rest of the data. Once a model is trained, you can use the data to test your model that was removed earlier. This is the basic principle of cross validation.

Know More ›

What is the cross_validate function in Python? ›

The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores, fitted estimators, train-test split indices) in addition to the test score.

What's a good cross-validation score? ›

In this case, an average score of approximately 0.91 suggests a strong performance. We used 5 folds for cross-validation, so we have 5 individual scores. Stratified k-fold cross validation is a method of cross-validation that ensures that the proportion of samples for each class is roughly the same in each fold.

How to calculate cross-validation accuracy? ›

Split the dataset into five folds. For each fold, train the model on four folds and evaluate it on the remaining fold. The average performance across all five folds is the estimated out-of-sample accuracy.

Keep Reading ›

Does cross-validation improve accuracy? ›

Conclusion. If you're looking to improve the accuracy and reliability of your statistical analysis, cross-validation is a crucial technique to learn.

Read The Full Story ›

How is cross-validation implemented? ›

Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

How do you set validation in Python? ›

Syntax using the flag:

# Syntax of using the flag variable to validate an input.
flag = False.
while not flag:
if [put condition here]:
flag = True.
else:
print('Input is not Valid')

Know More ›

How to implement leave one out cross-validation in Python? ›

Implementing leave-one-out-cross-validation can be done using cross_val_score(). You only need to set the parameter cv equal to the number of observations in your dataset. We can find the number of observations by looking at the shape of the X dataset.

Get More Info ›

How do you use cross-validation to avoid overfitting in Python? ›

Cross-validation is a robust means of preventing Overfitting. A complete data set is divided into parts. Standard K-fold cross-validation requires the data to be split into k folds. Then iteratively train the algorithm on k-1 folds, using the remaining holdout folds as the test set.

Discover More Details ›

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (2024)

@Machine Learning #Cross Validation

CV Concepts, types & practical implications.

FAQs

How would you implement cross-validation in Python? ›

Does cross-validation improve accuracy? ›

References