How to Choose Data Preparation Methods for Machine Learning

Data preparation is an important part of a predictive modeling project.
Correct application of data preparation will transform raw data into a representation that allows learning algorithms to get the most out of the data and make skillful predictions. The problem is choosing a transform or sequence of transforms that results in a useful representation is very challenging. So much so that it may be considered more of an art than a science.
In this tutorial, you will discover strategies that you can use to select data preparation techniques for your predictive modeling datasets.
After completing this tutorial, you will know:

Data preparation techniques can be chosen based on detailed knowledge of the dataset and algorithm and this is the most common approach.
Data preparation techniques can be grid searched as just another hyperparameter in the modeling pipeline.
Data transforms can be applied to a training dataset in parallel to create many extracted features on which feature selection can be applied and a model trained.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book , with 30 step-by-step tutorials and full Python source code.
Let’s get started.

How to Choose Data Preparation Methods for Machine Learning Photo by StockPhotosforFree , some rights reserved.

Tutorial Overview
This tutorial is divided into four parts; they are:

Strategies for Choosing Data Preparation Techniques
Approach 1: Manually Specify Data Preparation
Approach 2: Grid Search Data Preparation Methods
Approach 3: Apply Data Preparation Methods in Parallel

Strategies for Choosing Data Preparation Techniques
The performance of a machine learning model is only as good as the data used to train it.
This puts a heavy burden on the data and the techniques used to prepare it for modeling.
Data preparation refers to the techniques used to transform raw data into a form that best meets the expectations or requirements of a machine learning algorithm.
It is a challenge because we cannot know a representation of the raw data that will result in good or best performance of a predictive model.
However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.
— Page xii, Feature Engineering and Selection , 2019.
Instead, we must use controlled experiments to systematically evaluate data transforms on a model in order to discover what works well or best.
As such, on a predictive modeling project, there are three main strategies we...

Top Tech News