Repeated k-Fold Cross-Validation for Model Evaluation in Python
The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm or configuration on a dataset.
A single run of the k-fold cross-validation procedure may result in a noisy estimate of model performance. Different splits of the data may result in very different results.
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
In this tutorial, you will discover repeated k-fold cross-validation for model evaluation.
After completing this tutorial, you will know:
The mean performance reported from a single run of k-fold cross-validation may be noisy.
Repeated k-fold cross-validation provides a way to reduce the error in the estimate of mean model performance.
How to evaluate machine learning models using repeated k-fold cross-validation in Python.
Let’s get started.
Repeated k-Fold Cross-Validation for Model Evaluation in Python Photo by lina smith , some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
k-Fold Cross-Validation
Repeated k-Fold Cross-Validation
Repeated k-Fold Cross-Validation in Python
k-Fold Cross-Validation
It is common to evaluate machine learning models on a dataset using k-fold cross-validation.
The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
For more on the k-fold cross-validation procedure, see the tutorial:
A Gentle Introduction to k-fold Cross-Validation
The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.
First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.
The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 1,000 samples each with 20 input features, 15 of which contribute to the target variable.
The example below creates and summarizes the dataset.
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,...