Sklearn Model Selection

Sklearn's model selection module provides various functions to cross-validate our model, tune the estimator's hyperparameters, or produce validation and learning curves.

Here is a list of the functions provided in this module. Later we will understand the theory and use of these functions with code examples.

Splitter Classes

model_selection.GroupKFold([ n_splits ])This function is a variant of KFold cross-validation, which forms non-overlapping groups.
model_selection.GroupShuffleSplit([ ... ])This function performs Shuffle-Group(s)-Out cross-validation test.
model_selection.KFold([ n_splits, shuffle, ... ])This function is used to perform the KFold cross-validation test.
model_selection.LeaveOneGroupOut( )This function performs the Leave One Group Out cross-validation test.
model_selection.LeavePGroupsOut( n_groups )This function performs the test by using Leave P Group Out.
model_selection.LeaveOneOut( )This test performs the Leave One Out cross-validation.
model_selection.LeavePOut( p )Cross-validator with Leave-One-Out
model_selection.PredefinedSplit( test_fold )This is a general version of the Leave One Out test, i.e. Leave P Out cross-validation test.
model_selection.RepeatedKFold( *[, n_splits, ... ] )This performs a cross-validation test on a predefined split.
model_selection.RepeatedStratifiedKFold( *[, ... ] )This test performs the repeated stratified K-Fold cross-validation.
model_selection.ShuffleSplit([n_splits, ... ])This function performs a cross-validation test on a shuffled slit dataset using random permutations.
model_selection.StratifiedKFold([ n_splits, ... ])This function performs a stratified KFold cross-validation test.
model_selection.StratifiedShuffleSplit([ ... ])This function performs a stratified shuffle split cross-validation test.
model_selection.StratifiedGroupKFold([ ... ])This function performs a stratified K-Folds cross-validation test on non-overlapping groups.
model_selection.TimeSeriesSplit([ n_splits, ... ])This cross-validation test is for time series.

Splitter Functions

model_selection.check_cv([ cv, y, classifier ])This function checks the utility to perform a cross-validation test.
model_selection.train_test_split( *arrays[, ...] )This function performs cross-validation by separating matrices or arrays into training and testing datasets at random.

Hyper-parameter optimizers

model_selection.GridSearchCV( estimator, ... )This function executes an exhaustive search for an estimator over the defined parameters.
model_selection.HalvingGridSearchCV( ...[, ...] )This function executes a search over given parameters by using successive halving.
model_selection.ParameterGrid( param_grid )This function performs a grid of parameters, where each parameter has a discrete range of values.
model_selection.ParameterSampler( ...[, ...] )This function acts as a generator on parameter samples taken from the given distribution.
model_selection.RandomizedSearchCV( ...[, ...] )This function performs a randomized search on the hyper-parameters.
model_selection.HalvingRandomSearchCV( ...[, ...] )This function performs a randomized search on the hyper-parameters.

Cross-validation: assessing the performance of the estimator

A fundamental error data scientists make while creating a model is learning the parameters of a forecasting function and evaluating the model on the same dataset. A model that simply repeats the labels of the data points it has just been trained on would score well but be unable to make any predictions about data that the model has not yet seen. Overfitting is the term for this circumstance.

It is common to reserve a portion of the given data as validation or the test set (X test, y test) when conducting a machine learning experiment to avoid this problem. We should note that the term "experiment" does not just refer to academic purposes since machine learning experiments sometimes begin in today's business world.

Code

Output

Size of Dataset is: 150
K-fold Cross Validation Scores are:  [1.         1.         0.86666667 0.93333333 0.83333333]
Mean Cross Validation score is:  0.9266666666666665

Another example of the cross-validation method.

Code

Output

Size of Dataset is:  150
LeavePOut Cross Validation Scores are:  [1. 1. 1. ... 1. 1. 1.]
Average Cross Validation score is:  0.9494854586129754

Cross-validated Metrics Calculation

Calling the cross_val _score utility method for the estimator and the dataset is the most straightforward approach to calculating the performance of the cross-validation test.

The following example shows ways to split the data, develop a model, test it through a cross-validation test and calculate the score five times in a row (using various splits each time) to measure the accuracy score of a support vector machine model on the sklearn's iris dataset.

Code

Output

0.9666666666666667
[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001

Tuning the Hyper-Parameters of an Estimator

Hyper-parameters are variables that estimators do not explicitly acquire, and they are supplied as parameters to the constructor of the class of estimator. In Support Vector Classifier, common examples are C, kernel and gamma, alpha for Lasso, etc.

Searching the hyper-parameter domain for the maximum cross-validation score is feasible and advised.

We can use this method to optimise any supplied parameter while building an estimator. Use the following function:

to obtain the names and the current values of all the parameters for a given estimator.

The components of a search are: an estimator function (for example, regressor or classifier, like sklearn.svm.SVC()), a parameter field, a search strategy or sampling procedure, a cross-validation strategy, and a scoring function.

GridSearchCV uses two methods: "fit" and "score." If the estimator we use for grid search supports any of the following methods, it also implements the. The methods are "score samples", "predict proba", "predict", "transform", and "inverse transform", "decision function".

Cross-validated grid search is employed over a parameter grid to improve the estimator's parameters on which the grid search strategies are used.

Code

Output

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 20], 'kernel': ('linear', 'poly')})
['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

We can improve the efficiency of GridSearchCV by performing halving grid search algorithm.

The search approach begins by analysing each option with a small number of elements before selecting the best options repeatedly with progressively more elements.

Code

Output

{'max_depth' : None, 'min_samples_split': 10, 'n_estimators': 9}

Points to Remember while Performing Parameter Search

Defining an Objective Metric

A parametric search default evaluates a parameter configuration using the estimator's scoring function. For classification, the metrics used are sklearn.metrics.accuracy_score, and for regression analysis, the sklearn.metrics.r2_score. Other scoring metrics may be more appropriate for a certain situation.

Defining More than One Metrics for Evaluation

We can specify multiple scoring metrics with the GridSearchCV and RandomizedSearchCV functions.

Multimetric scoring methods can be given as parameters as either a Python dictionary mapping the name of the scorer to the scoring function and/or the predetermined scorer name or as a collection of strings of predetermined score names(s).

Nested Estimators and Parameter Spaces

Using a special <estimator>_<parameter> syntax, GridSearchCV and RandomizedSearchCV enable looking over the parameters of nested estimators like Pipeline, VotingClassifier, CalibratedClassifierCV or ColumnTransformer.

Parallelism

The parameter search functions separately evaluate each parameter permutation on each data subset. The term n jobs=-1 enables parallel processing of calculations.

Robustness to Failure

Some parameter selections could make accommodating one or more data subsets impossible. Even though the algorithm might completely assess some parameter values, this will, by default, fail in the whole search. The algorithm will resist such failures if error_score=0 (or =np.NaN) is specified, generating a warning and assigning the score value for that subset to 0 (or NaN) but finishing the search.

Calculating the Prediction Quality

We can use three different APIs to assess how well a model predicts the future:

Estimator scoring system: Estimation methods have a scoring system that offers a default grading standard for the subject they are intended to address. This is covered in each estimator's manual.

  • Scoring Criteria: Cross-validation modelling evaluation methods use an intrinsic scoring system, such as sklearn.model_selection.cross_val_score and sklearn.model_selection.GridSearchCV.
  • Metric Systems: For particular uses, the sklearn.metrics module contains routines for measuring estimation error.

Examples of Model Evaluations Techniques

The functions such as model_selection.GridSearch() and model_selection.cross_val_score() provide a scoring parameter to define the scoring method.

Using sklearn.metrics to Include Different Scoring Methods

Code

Output

Accuracy score for the scoring method 'balanced_accuracy' : 
 [0.95238095 1.         1.         0.95238095 0.95238095 0.95238095
 1.        ]
Accuracy score for the scoring method 'f1_micro' : 
 [0.95454545 1.         1.         0.95238095 0.95238095 0.95238095
 1.        ]

Additionally, Scikit-learn's GridSearchCV, RandomizedSearchCV, and cross_validate all support the evaluation based on multiple metrics simultaneously.

Code

Output

{'fit_time': array([0.00099993, 0.00099993, 0.0150001 , 0.00200009, 0.00199986,
       0.00100017, 0.00100017]), 'score_time': array([0.00099993, 0.00099993, 0.00099993, 0.00200033, 0.00099993,
       0.00099993, 0.00099993]), 'test_accuracy': array([0.95454545, 1.        , 1.        , 0.95238095, 0.95238095,
       0.95238095, 1.        ]), 'test_prec': array([0.9331307 , 1.        , 1.        , 0.92857143, 0.92857143,
       0.92857143, 1.        ])}
{'fit_time': array([0.00099993, 0.00099993, 0.0150001 , 0.00200009, 0.00199986,
       0.00100017, 0.00100017]), 'score_time': array([0.00099993, 0.00099993, 0.00099993, 0.00200033, 0.00099993,
       0.00099993, 0.00099993]), 'test_accuracy': array([0.95454545, 1.        , 1.        , 0.95238095, 0.95238095,
       0.95238095, 1.        ]), 'test_prec': array([0.9331307 , 1.        , 1.        , 0.92857143, 0.92857143,
       0.92857143, 1.        ])}

Implementing Our Own Scoring Object

By creating our custom scoring object instead of using the function make scorer, we can build different scoring methods which will be more adaptable to our model. A callable must adhere to the process outlined by the two following principles to be a scorer:

  • It supports parameter calls (estimator, X, y).
  • It gives back a floating point value by supplying the data that measures the accuracy score of the model's predictions for a sample dataset X regarding the target variable y.

A set of straightforward routines evaluating an estimation error given the predicted values and the true values for a dataset are also provided by the sklearn.metrics module:

  • The greater the value of the score for the predicted values returned by the functions that finish in "_score," the better the model is.
  • The smaller the score value for the predicted values returned by the functions that end in _error or _loss, the better the model is. Set the greater_is_better argument to False using the make_scorer function to create a scorer object.

Validation Curves and Learning Curves

Each estimator that we build has benefits and disadvantages. Biases, variance errors, and noise combine to form the generalisation error. An estimator's bias is represented by the average value of the error across many training datasets. An estimator's variance reveals how responsive it has been to various training datasets. Noise is a characteristic of raw data.

Since biases and variance are inherent characteristics of the estimators, we typically have to choose the learning algorithms and tune the hyperparameters to minimise bias and variance. Adding an extra training dataset to a model is another widely used technique to lower the estimator's variance. However, we must only gather more training data if the estimate cannot well model the actual function with a slight variation due to its complexity.

Validation Curve

We require a scoring system, such as a classifier accuracy function, to evaluate a model. Grid search or other similar strategies that tune the hyperparameter of the estimator with the best score for a validation dataset are the correct way to choose numerous hyperparameters of our estimator. Be aware that the validation accuracy score is biased and no longer a reliable indicator of the generalisation if the hyperparameters are tuned based on it. We must calculate the accuracy score on a different test set to obtain an accurate approximation of the generalisation.

Code

Output

[[0.93973174 0.9494471  0.94188484 0.9435467  0.94419797 0.94252983
  0.94608803 0.9460067  0.9423957  0.94253182]
 [0.93944487 0.94914898 0.94164577 0.94328439 0.9439338  0.94224685
  0.94584605 0.9457347  0.94212464 0.94227221]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]]
[[ 0.97300409  0.87636051  0.95494321  0.9417682   0.92512841  0.93081973
   0.91016821  0.91853245  0.94939447  0.9475685 ]
 [ 0.96865934  0.8803329   0.95417121  0.9445263   0.92533163  0.93193911
   0.90834795  0.91843293  0.95295644  0.94553797]
 [-0.01991239 -0.20576132 -0.09876543 -0.01991239 -0.09876543 -0.51981806
  -0.02805836  0.         -0.01991239 -0.01991239]]

The estimator will be under-fitted if the values of the training score and the values of the validation scores are both small. The estimator has overfitted if the value of the training scores is higher than the values of the validation scores. Still, if the scenario is otherwise, the estimator performs quite well. It is typically impossible to have high values of the validation score and a low training score.

Learning Curve

A learning curve displays an estimator's training and validation scores for various lengths of training samples which are given as an argument. It is a method to determine how much additional training dataset might help and if the estimator is more susceptible to bias or variance errors. Look at the example below, where we show the learning curves of the Lasso classifier.

Code

Output

[[0.50150218 0.43690063 0.5154695  0.51452348 0.51452348 0.51452348
  0.51452348 0.51452348 0.51452348 0.51452348]
 [0.48754102 0.46153892 0.50115233 0.47687368 0.53209581 0.4838794
  0.4838794  0.4838794  0.4838794  0.4838794 ]
 [0.47761141 0.45787213 0.48397355 0.46741065 0.49652965 0.47942832
  0.50579115 0.48615822 0.45051371 0.48083313]]
[[0.43915574 0.33445944 0.38829399 0.50764841 0.45173949 0.15496657
  0.40651899 0.48801649 0.56571766 0.4732054 ]
 [0.44653145 0.42970004 0.4145817  0.4872139  0.43139094 0.21609031
  0.42580156 0.48481259 0.55030939 0.4521308 ]
 [0.43844149 0.40086229 0.43313405 0.46494005 0.38326539 0.23284754
  0.46030724 0.48905027 0.51427695 0.42897388]]





Latest Courses