Why Data Preparation Is So Important in Machine Learning


On a predictive modeling project, machine learning algorithms learn a mapping from input variables to a target variable.
The most common form of predictive modeling project involves so-called structured data or tabular data. This is data as it looks in a spreadsheet or a matrix, with rows of examples and columns of features for each example.
We cannot fit and evaluate machine learning algorithms on raw data; instead, we must transform the data to meet the requirements of individual machine learning algorithms. More than that, we must choose a representation for the data that best exposes the unknown underlying structure of the prediction problem to the learning algorithms in order to get the best performance given our available resources on a predictive modeling project.
Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine. As such, the most challenging part of each predictive modeling project is how to prepare the one thing that is unique to the project: the data used for modeling.
In this tutorial, you will discover the importance of data preparation for each machine learning project.
After completing this tutorial, you will know:

Structure data in machine learning consists of rows and columns in one large table.
Data preparation is a required step in each machine learning project.
The routineness of machine learning algorithms means the majority of effort on each project is spent on data preparation.

Let’s get started.

Why Data Preparation Is So Important in Machine Learning Photo by lwtt93 , some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

What Is Data in Machine Learning
Raw Data Must Be Prepared

Machine Learning Algorithms Expect Numbers
Machine Learning Algorithms Have Requirements
Model Performance Depends on Data

Predictive Modeling Is Mostly Data Preparation

What Is Data in Machine Learning
Predictive modeling projects involve learning from data.
Data refers to examples or cases from the domain that characterize the problem you want to solve. In supervised learning, data is composed of examples where each example has an input element that will be provided to a model and an output or target element that the model is expected to predict.
What we call data are observations of real-world phenomena. […] Each piece of data provides a small window into a limited aspect of reality.
— Page 1, Feature Engineering for Machine Learning , 2018.
Classification is an example of a supervised learning problem where the target is a label, and regression is an example of a supervised learning problem where the target is a number.
The input data may have many forms, such as an image,...

Top