Cross-Validation in Sklearn

Data scientists can benefit from cross-validation while working with machine learning models in two key aspects: it can assist in minimizing the amount of data needed and ensuring the machine learning model is reliable. Cross-validation accomplishes that at the expense of resource use; thus, it's critical to comprehend how it operates before deciding to use it.

In this article, we'll quickly go over the advantages of cross-validation, and then we will go through its application using a wide range of techniques from the well-known Python Scikit-learn package.

What is Cross-validation?

A fundamental error is training the model to make a prediction function and then using the same data to test the model and get a validation score. A model that simply repeats the labels of the samples it has just examined would receive a perfect score but be unable to make predictions about data that has not yet been seen. Overfitting is the term used to describe this circumstance. To avoid this, it is customary to reserve a portion of the given data as a test set (X test, y test) when conducting a (supervised) machine learning study. Because machine learning sometimes begins as an experiment in business contexts, we should note that the word "experiment" does not just refer to academic application.

How does Cross Validation Solve the Problem of Overfitting?

We create numerous micro train-test splits during cross-validation using our initial training dataset. To fine-tune our model, use these splits. For instance, we divide the dataset into k subgroups for the usual k-fold cross-validation. The remaining dataset is then used as the test dataset after the model has been successively trained on the k-1 dataset. We may test the model on a new dataset in this manner. We will learn about the seven most popular cross-validation approaches in this tutorial. The code samples for each method are also included.

The train test split function in scikit-learn can be used to swiftly generate a randomized split into the train and test datasets. To train a linear regression model on the iris dataset, we will import the dataset first.

Code

# Python program to show what overfitting means

# Importing the required libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import datasets

# Loading the dataset
X, Y = datasets.load_iris( return_X_y = True )

# Printing the shape of the dataset
print(X.shape, Y.shape)

# Splitting the dataset in training and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

# Printing the shape of training and test data
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

# Training the model
lr = LinearRegression()
lr.fit(X_train, Y_train)

# Printing the score of the model
score = lr.score(X_test, Y_test)
print(score)

Output:

(150, 4) (150,)
(90, 4) (90,)
(60, 4) (60,)
0.888564580463006

There is still a chance of overfitting the test dataset when comparing various settings for estimators. This is why We can adjust the parameters until the estimator works at its best. The model may "leak" information about the test dataset in this method, and evaluation measures may no longer reflect generalization performance. We can resolve this issue by holding out a different portion of the dataset as a "validation set": training is conducted on the training dataset, followed by evaluation on the validation dataset, and when it appears that the experiment has succeeded, we can perform a final assessment on the test set.

Data Size Reduction

Usually, the data is divided into three sets.

Training: used to hone the hyperparameters of the machine learning model and train the model.

Testing: used to ensure that the improved model performs well when applied to new data and that the model generalizes correctly.

Validation: We execute the last check on utterly unreliable data because, when optimizing, some knowledge about the test dataset seeps into the model due to the choice of parameters.

Because we can train and test using the same dataset, adding cross-validation to the workflow helps you eliminate the requirement for the validation dataset.

Robust Process

Even though sklearn's train test split method uses a stratified split, which ensures that the target variable's distribution is the same in both the train and test sets, it's still possible to unintentionally train on a subset that doesn't accurately represent the real world.

Methods of Cross-Validation with Sklearn

HoldOut Cross Validation or Train-Test Split

This cross-validation procedure randomly divides the entire dataset into a training dataset and a validation dataset. Generally, approximately 70% of the whole dataset is utilized as a training set, and the leftover 30% is taken as a validation dataset.

The advantage of this method is that we only need to divide the dataset into the training and validation sets once. The machine learning model will only need to be trained once based on the training dataset, allowing for quick execution.

This method is not appropriate for an unbalanced dataset. Consider an unbalanced dataset with classes "0" and "1". Let's assume that 80% of the data falls under class '0' and the rest 20% falls under class '1' upon performing a train-test splitting, with the training dataset making up to 80% of the dataset and the test data making up 20%. The training dataset may contain 100% of the class "0" data, and the test dataset has 100% of the class "1" data. Since our model has never previously encountered class "1" data, it will not generalize well to our testing data.

Code

# Python program to show how to perform holdout cross-validation using train_test_split function 

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Creating a logistic regression object 
logreg = LogisticRegression()

# Seperating the train and test dataset
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=41)

# Performing the logistic regression on train dataset
logreg.fit(x_train,y_train)

# Predicting target values using trained model and test dataset
predict = logreg.predict(x_test)

# Printing accuracy score
print("Accuracy score for the training dataset is: ", accuracy_score(logreg.predict(x_train), y_train))
print("Accuracy score for the testing dataset is: ", accuracy_score(predict, y_test))

Output:

Size of Dataset is: 150
Accuracy score for the training dataset is:  0.9904761904761905
Accuracy score for the testing dataset is:  0.9111111111111111

K-Fold Cross Validation

The entire dataset is divided into K equal-sized pieces using the K-Fold cross-validation procedure. Each division is referred to as a "Fold." We refer to it as K-Folds because there are K pieces. The other K-1 folds are utilized as the training dataset, while One of the Fold is employed as a validation dataset.

Until each fold is employed as a validation dataset and the leftover folds are the training datasets, the procedure is repeated K times.

The average accuracy of the k number of models of the validation dataset is used to calculate the model's final accuracy.

Code

# Python program to show how to perform k-fold cross-validation test

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Creating a logistic regression object 
logreg = LogisticRegression()

# Performing the logistic regression on train dataset
logreg.fit(X, Y)

# Performing K-fold cross-validation test
kf = KFold(n_splits = 5)
score = cross_val_score(logreg, X, Y, cv=kf)

# Printing accuracy scores
print("K-fold Cross Validation Scores are: ", score)
print("Mean Cross Validation score is: ", score.mean())

Output:

Size of Dataset is: 150
K-fold Cross Validation Scores are:  [1.         1.         0.86666667 0.93333333 0.83333333]
Mean Cross Validation score is:  0.9266666666666665

This technique should not evaluate unbalanced datasets. All the folds of the training dataset may only contain samples from class "0" and not any samples from class "1," as was mentioned in the context of HoldOut cross-validation. And validation dataset will include a sample having class "1".

Additionally, we cannot use time series data with this-the ordering of the samples matters for Time Series data. As opposed to this, samples are chosen at random for K-Fold Cross-Validation.

Stratified K-Fold Cross Validation

The improved K-Fold cross-validation method known as stratified K-Fold is typically applied to unbalanced datasets. The entire dataset is split into K-folds of the same size, just like K-fold.

However, in this method, each fold will contain the same proportion of target variable occurrences as the entire dataset.

This method performs correctly with unbalanced data. In stratified cross-validation, each fold will represent all data classes in the same proportion across the entire dataset.

The ordering of the samples matters for time series data; hence this method is not appropriate for time series. Stratified Cross-Validation, however, chooses samples in random order.

Code

# Python program to show how to perform stratified k-fold cross-validation test

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Creating a logistic regression object 
logreg = LogisticRegression()

# Performing the logistic regression on train dataset
logreg.fit(X, Y)

# Performing stratified K-fold cross-validation test
stratified_kfold = StratifiedKFold(n_splits = 5)
score = cross_val_score(logreg, X, Y, cv = stratified_kfold )

# Printing accuracy scores
print("Stratified k-fold Cross Validation Scores are: ", score )
print("Average Cross Validation score is: ", score.mean())

Output:

Size of Dataset is: 150
Stratified k-fold Cross Validation Scores are:  [0.96666667 1.         0.93333333 0.96666667 1.        ]
Average Cross Validation score is:  0.9733333333333334

Leave P Out Cross Validation

A thorough cross-validation method called LeavePOut cross-validation uses the n-p samples for training the model and the leftover p-samples as the validation dataset.

Assume that the dataset contains 100 samples. If we choose p=10, 10 data entries will be utilized as the validation dataset for each iteration, while the rest 90 samples will constitute the training dataset.

This procedure is repeatedly executed until the entire dataset has been split into an n-p training sample dataset and a validation dataset of p-samples.

Every data sample is utilized for training and validation.

High computing time is required by this method. The method mentioned above will take more time to compute because it will keep executing itself until all samples have been used as a validation dataset.

This approach is not appropriate for unbalanced datasets. Similar to K-Fold Cross-validation, our model may not be generalized for the validation dataset if the training dataset contains samples from only one class.

Code

# Python program to show how to perform Leave P Out cross-validation test

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, LeavePOut

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Splitting p dataset for LeavePOut cross-validation test
leave_pout = LeavePOut(p = 2)
leave_pout.get_n_splits(X)

# Creating a Random forest classifier object and computing the accuracy score
tree = RandomForestClassifier(n_estimators = 10, max_depth = 5, n_jobs = -1)
score = cross_val_score(tree, X, Y, cv = leave_pout)

# Printing accuracy scores
print("LeavePOut Cross Validation Scores are: ", score)
print("Average Cross Validation score is: ", score.mean())

Output:

Size of Dataset is:  150
LeavePOut Cross Validation Scores are:  [1. 1. 1. ... 1. 1. 1.]
Average Cross Validation score is:  0.9494854586129754

Leave One Out Cross Validation

This exhaustive cross-validation method, known as "LeaveOneOut cross-validation", uses the n-1 samples as the training dataset and the remaining 1 sample point as the validation dataset.

Assume that the dataset contains 100 samples. Following then, one value will be utilised as a validation dataset for each iteration, while the rest 99 samples will serve as the training dataset. As a result, the procedure is repeated until each sample in the dataset has served as a validation sample.

With p=1, it is equivalent to LeavePOut cross-validation.

Code

# Python program to show how to perform Leave One Out cross-validation test

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, LeaveOneOut

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Splitting p dataset for LeavePOut cross-validation test
leave_oneout = LeaveOneOut()

# Creating a Random forest classifier object and computing the accuracy score
tree = RandomForestClassifier(n_estimators = 10, max_depth = 5, n_jobs = -1)
score = cross_val_score(tree, X, Y, cv = leave_oneout)

# Printing accuracy scores
print("LeavePOut Cross Validation Scores are: ", score)
print("Average Cross Validation score is: ", score.mean())

Output:

Size of Dataset is:  150
LeavePOut Cross Validation Scores are:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Average Cross Validation score is:  0.9466666666666667

Monte Carlo Cross Validation(Shuffle Split)

A remarkably versatile cross-validation technique is Monte Carlo cross-validation, often referred to as Shuffle Split cross-validation. The datasets are arbitrarily divided into datasets to train and validate models in this method.

We have chosen how many portions of the dataset will serve as a training dataset and which part will serve as a validation dataset. The leftover dataset is not used in the training and validation datasets if the combined percentage of the training and the validation dataset size does not equal 100.

Let's assume we have 100 samples, of which 60% will be utilized as training data, and 20% will be used for validating the model. The remaining 20% (100 - (60 + 20)) will not be used.

We must specify how many times the splitting must occur.

The length of the datasets to train and validate the model is up to us.
We are not dependent on the number of splits for cycles and can pick the number of cycles.

An issue is that the model might not choose some samples in either the training or validation dataset.

Additionally, this method is unsuitable for datasets with imbalances. After the length of the training dataset and the validation, the dataset has been determined; all samples are chosen at random. As a result, it is possible that the training dataset does not contain the same class of data as the dataset for validation, making it impossible for the model to generalize to new data.

Code

# Python program to show how to perform Monte Carlo cross-validation test

# Importing the required library
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, ShuffleSplit

# Loading the dataset
iris = load_iris()

# Getting dependent and independent features
X = iris.data
Y = iris.target
print("Size of Dataset is: ", len(X))

# Creating a logistic regression object 
logreg = LogisticRegression()

# Performing the logistic regression on train dataset
logreg.fit(X, Y)

# Performing Shuffle Split cross-validation test
shuffle_split = ShuffleSplit(test_size = 0.3, train_size = 0.5, n_splits = 10)
score = cross_val_score(logreg, X, Y, cv = shuffle_split)

# Printing accuracy scores
print("Shuffle Split Cross Validation Scores are: ", score)
print("Mean Cross Validation score is: ", score.mean())

Output:

Size of Dataset is:  150
Shuffle Split Cross Validation Scores are:  [0.93333333 0.91111111 0.91111111 0.93333333 1.         0.97777778 0.93333333 0.93333333 0.95555556 0.93333333]
Mean Cross Validation score is:  0.9422222222222223

Time Series Cross Validation

What is a Time Series Data?

Data collected over a time period in a series is referred to as a time series. There is a chance for a correlation within observations because the data points were collected at close intervals. This is an essential time series feature, differentiating it from standard data.

We've already stated that time series data cannot be analyzed using earlier techniques. Therefore, we shall now cross-validate time series data.

Cross Validation for Time Series Data

Since we cannot use future values to predict values for the past, we are unable to select random samples of the time series data to assign them for training or validation datasets.

We divided the data into training and validating the model according to time using the "Forward chaining" approach, also known as rolling cross-validation, because the data chronology is crucial for time series-related tasks.

For the training dataset, we begin with a small part of the dataset. We make predictions for subsequent data points using that dataset and verify the accuracy.

The succeeding training dataset then contains the predicted values, and further samples are predicted.

This is one of the best methods for time series.

Other types of data cannot be validated using this. In this strategy, the sequence of the dataset is crucial, but in different methods, the algorithm picks random samples for the training or validation dataset.

Code

# Python program to show how to perform cross-validation test on Time Series data

# Importing the required library
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, TimeSeriesSplit

# Creating the dataset
X = np.array([[1, 3], [4, 5], [6, 8], [9, 10], [11, 12], [13, 14], [15, 17], [18, 20], [21, 22]])
Y = np.array([1, 2, 5, 8, 2, 8, 9, 3, 3, 7])

# Splitting the dataset using the TimeSeriesSplit function
time_series_split = TimeSeriesSplit(6)
print(time_series_split)

scores = []
# Printing the dataset divided for training and validation of the model and training the model
for train_indices, test_indices in time_series_split.split(X):
    print("TRAIN:", train_indices, "TEST:", test_indices)
    X_train, X_test = X[train_indices], X[test_indices]
    Y_train, Y_test = Y[train_indices], Y[test_indices]
    logreg = LogisticRegression()
    logreg.fit(X_train, Y_train)
    preds = logreg.predict(X_test)

    # accuracy score for the current fold 
    score = logreg.score(X_test, Y_test)

    scores.append(score)

# Printing accuracy scores
print("TimeSeriesSplit Cross Validation Scores are: ", scores)

Output:

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=6, test_size=None)
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [0 1 2 3 4 5 6] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8]
TimeSeriesSplit Cross Validation Scores are:  [0.0, 0.0, 1.0, 0.0, 0.0, 1.0]

Comparison of Cross-validation to Train/Test Split

Test/training split: A ratio of 70:30, 80:20, etc., is used to divide the input dataset into a training dataset and test dataset portions. One of its main drawbacks is the considerable variance it offers.

Training Data: The dependent feature is known, and the training dataset is utilized for training our model.

Test Data: The model, which has already been trained, makes predictions using the test dataset. Although not a component, this has similar characteristics to the training dataset.

By dividing our dataset into sets of train/test splits and averaging the results, the cross-validation dataset is utilized to address the drawback of the train/test split. It can be employed if we wish to improve the performance of our model after it has been trained using the training dataset. Since every data point is used for training and testing the model, it is more effective than a train/test split.

Limitations of Cross-Validation

The cross-validation method has some drawbacks, some of which are listed below:

It delivers the best results under the best circumstances. However, the contradictory data could lead to a dramatic outcome. There is uncertainty over the type of data used in machine learning, which is one of the significant drawbacks of cross-validation.

Because data in predictive modelling changes over time, there may be variations between the training and validation datasets. For instance, if we develop a stock market value prediction model and the dataset is trained on the stock prices from the previous five years. Still, the plausible future stock prices for the following five years could be very different. It is challenging to anticipate the appropriate result in such circumstances.

Applications of Cross-Validation

We can use this approach to evaluate how various predictive modelling approaches work.
It has a lot of potential for medical study.
As data scientists are already using it in medical statistics, we can also employ it for meta-analysis.

Visualizing Cross Validation Methods

Code

# Python program to visualize splits of the dataset in various cross-validation methods

# Importing required libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    TimeSeriesSplit,
    ShuffleSplit,
    StratifiedShuffleSplit
)

# Creating a list of cross-validation methods offered by sklearn
cvm = [
    KFold,
    StratifiedKFold,
    TimeSeriesSplit,
    ShuffleSplit,
    StratifiedShuffleSplit
]

rng = np.random.RandomState(1475)
n_splits = 4
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm


# Creating random datasets 
n_data = 100
X = rng.randn( 100, 10 )

# Creating classes for dependent feature y
percentiles_classes = [0.1, 0.3, 0.6]
y = np.hstack([ [ind] * int(100 * percentile) for ind, percentile in enumerate( percentiles_classes ) ])

# Generate uneven groups of data
group = rng.dirichlet( [2] * 10 )
groups = np.repeat(np.arange(10), rng.multinomial(100, group))

# Creating a function to plot the indices of data by a certain cross-validation method
def plot_indices(cvm, X, y, p_group, axis, n_splits, lw = 10):
    """This function will create a dummy plot of the indices of the cross-validation method cvm"""

    # Creating the training/testing visualization for each cross validation split of the dataset
    for ind, (r, t) in enumerate( cvm.split(X = X, y = y, groups = p_group) ):
        # Giving the indices the training / test group
        index = np.array( [np.nan] * len(X) )
        index[t] = 1
        index[r] = 0

        # Visualizing the results of training/testing data
        axis.scatter( range(len(index)), [ind + 0.5] * len(index), c = index, marker = "_", lw = lw, cmap = cmap_cv, vmin = -0.2, vmax = 1.2 )

    # Ploting the classes and groups of our dataset on the same graph
    axis.scatter( range(len(X)), [ind + 1.5] * len(X), c = y, marker = "_", lw = lw, cmap = cmap_data )

    axis.scatter( range(len(X)), [ind + 2.5] * len(X), c = p_group, marker = "_", lw = lw, cmap = cmap_data )

    # Formatting the graphs to make them more readable 
    yticklabels = list(range(n_splits)) + ["class", "group"]
    axis.set( yticks = np.arange(n_splits + 2) + 0.5, yticklabels = yticklabels, xlabel = "Sample index", ylabel = "CV iteration", ylim = [n_splits + 2.2, -0.2], xlim = [0, 100] )
    axis.set_title( f"{type(cvm).__name__}", fontsize = 15 )
    return axis


for cv in cvm:
    current_cv = cv(n_splits = n_splits)
    figure, axis = plt.subplots(figsize=(8, 4))
    
    
    plot_indices(current_cv, X, y, groups, axis, n_splits)

    axis.legend( [Patch(color = cmap_cv(0.8)), Patch(color = cmap_cv(0.02))], ["Testing dataset", "Training dataset"], loc = (1.02, 0.8) )
    # Adjusting the position of legends of the graphs
    plt.tight_layout()
    figure.subplots_adjust(right = 0.5)
    plt.show()

Output:

Conclusion

Many machine learning tasks use the train-test split basic idea; however, if we have enough resources available to us, we should think about using cross-validation to solve our problem. An inconsistent score throughout the many folds would indicate that we are missing a crucial relationship inside the data we have. This approach also helps us in using fewer data.

← prev next →