Train-Test Split for Evaluating Machine Learning Algorithms


The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.
In this tutorial, you will discover how to evaluate machine learning models using the train-test split.
After completing this tutorial, you will know:

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
How to use the scikit-learn machine learning library to perform the train-test split procedure.
How to evaluate machine learning algorithms for classification and regression using the train-test split.

Let’s get started.

Train-Test Split for Evaluating Machine Learning Algorithms Photo by Paul VanDerWerf , some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

Train-Test Split Evaluation

When to Use the Train-Test Split
How to Configure the Train-Test Split

Train-Test Split Procedure in Scikit-Learn

Repeatable Train-Test Splits
Stratified Train-Test Splits

Train-Test Split to Evaluate Machine Learning Models

Train-Test Split for Classification
Train-Test Split for Regression

Train-Test Split Evaluation
The train-test split is a technique for evaluating the performance of a machine learning algorithm.
It can be used for classification or regression problems and can be used for any supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Train Dataset : Used to fit the machine learning model.

Test Dataset : Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.
This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or...

Top