How to Avoid Data Leakage When Performing Data Preparation


Data preparation is the process of transforming raw data into a form that is appropriate for modeling.
A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage , where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.
A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.
In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.
After completing this tutorial, you will know:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Let’s get started.

How to Avoid Data Leakage When Performing Data Preparation Photo by kuhnmi , some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

Problem With Naive Data Preparation
Data Preparation With Train and Test Sets

Train-Test Evaluation With Naive Data Preparation
Train-Test Evaluation With Correct Data Preparation

Data Preparation With k-fold Cross-Validation

Cross-Validation Evaluation With Naive Data Preparation
Cross-Validation Evaluation With Correct Data Preparation

Problem With Naive Data Preparation
The manner in which data preparation techniques are applied to data matters.
A common approach is to first apply one or more transforms to the entire dataset. Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a machine learning model.

1. Prepare Dataset
2. Split Data
3. Evaluate Models

Although this is a common approach, it is dangerously incorrect in most cases.
The problem with applying data preparation techniques before splitting data for model evaluation is that it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s performance on the problem.
Data leakage refers to a problem where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset. This leakage is often small and subtle but can have a marked effect on performance.
… leakage means that information is revealed to...

Top