What Is Data Preparation in a Machine Learning Project


Data preparation may be one of the most difficult steps in any machine learning project.
The reason is that each dataset is different and highly specific to the project. Nevertheless, there are enough commonalities across predictive modeling projects that we can define a loose sequence of steps and subtasks that you are likely to perform.
This process provides a context in which we can consider the data preparation required for the project, informed both by the definition of the project performed before data preparation and the evaluation of machine learning algorithms performed after.
In this tutorial, you will discover how to consider data preparation as a step in a broader predictive modeling machine learning project.
After completing this tutorial, you will know:

Each predictive modeling project with machine learning is different, but there are common steps performed on each project.
Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithms.
The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.

Let’s get started.

What Is Data Preparation in a Machine Learning Project Photo by dashll , some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

Applied Machine Learning Process
What Is Data Preparation
How to Choose Data Preparation Techniques

Applied Machine Learning Process
Each machine learning project is different because the specific data at the core of the project is different.
You may be the first person (ever!) to work on the specific predictive modeling problem. That does not mean that others have not worked on similar prediction tasks or perhaps even the same high-level task, but you are the first to use the specific data that you have collected (unless you are using a standard dataset for practice).
… the right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.
— Page vii, Feature Engineering for Machine Learning , 2018.
This makes each machine learning project unique. No one can tell you what the best results are or might be, or what algorithms to use to achieve them. You must establish a baseline in performance as a point of reference to compare all of your models and you must discover what algorithm works best for your specific dataset.
You are not alone, and the vast literature on applied machine learning that has come before can inform you as to techniques to use to robustly evaluate your model and algorithms to evaluate.
Even though your project is unique, the steps on the path to a good or even the best result are generally...

Top Tech News


Top