How to Configure k-Fold Cross-Validation
The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm on a dataset.
A common value for k is 10, although how do we know that this configuration is appropriate for our dataset and our algorithms?
One approach is to explore the effect of different k values on the estimate of model performance and compare this to an ideal test condition. This can help to choose an appropriate value for k .
Once a k -value is chosen, it can be used to evaluate a suite of different algorithms on the dataset and the distribution of results can be compared to an evaluation of the same algorithms using an ideal test condition to see if they are highly correlated or not. If correlated, it confirms the chosen configuration is a robust approximation for the ideal test condition.
In this tutorial, you will discover how to configure and evaluate configurations of k-fold cross-validation.
After completing this tutorial, you will know:
How to evaluate a machine learning algorithm using k-fold cross-validation on a dataset.
How to perform a sensitivity analysis of k-values for k-fold cross-validation.
How to calculate the correlation between a cross-validation test harness and an ideal test condition.
Let’s get started.
How to Configure k-Fold Cross-Validation Photo by Patricia Farrell , some rights reserved.
This tutorial is divided into three parts; they are:
Sensitivity Analysis for k
Correlation of Test Harness With Target
It is common to evaluate machine learning models on a dataset using k-fold cross-validation.
The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
For more on the k-fold cross-validation procedure, see the tutorial:
A Gentle Introduction to k-fold Cross-Validation
The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.
First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.
The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 100 samples each with 20 input features, 15 of which contribute to the target variable.
The example below creates and summarizes the dataset.
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=100,...