Tour of Data Preparation Techniques for Machine Learning

Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation.
The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data.
Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. data that forms a large table like in a spreadsheet). These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project.
In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.
After completing this tutorial, you will know:

Techniques such as data cleaning can identify and fix errors in data like missing values.
Data transforms can change the scale, type, and probability distribution of variables in the dataset.
Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Let’s get started.

Tour of Data Preparation Techniques for Machine Learning Photo by Nicolas Raymond , some rights reserved.

Tutorial Overview
This tutorial is divided into six parts; they are:

Common Data Preparation Tasks
Data Cleaning
Feature Selection
Data Transforms
Feature Engineering
Dimensionality Reduction

Common Data Preparation Tasks
We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.
Nevertheless, there are steps in a predictive modeling project before and after the data preparation step that are important and inform the data preparation that is to be performed.
The process of applied machine learning consists of a sequence of steps.
We may jump back and forth between the steps for any given project, but all projects have the same general steps; they are:

Step 1 : Define Problem.

Step 2 : Prepare Data.

Step 3 : Evaluate Models.

Step 4 : Finalize Model.

We are concerned with the data preparation step (step 2), and there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.
The types of data preparation performed depend on your data, as you might expect.
Nevertheless, as you work through multiple predictive modeling projects, you see and require the same types of data preparation tasks again and again.
These tasks include:

Data Cleaning : Identifying and correcting mistakes or errors in the data.

Feature Selection : Identifying those input variables that...