How to Use Discretization Transforms for Machine Learning


Numerical input variables may have a highly skewed or non-standard distribution.
This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.
Many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution.
The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model.
In this tutorial, you will discover how to use discretization transforms to map numerical values to discrete categories for machine learning
After completing this tutorial, you will know:

Many machine learning algorithms prefer or perform better when numerical with non-standard probability distributions are made discrete.
Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.
How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.

Let’s get started.

How to Use Discretization Transforms for Machine Learning Photo by Kate Russell , some rights reserved.

Tutorial Overview
This tutorial is divided into six parts; they are:

Change Data Distribution
Discretization Transforms
Sonar Dataset
Uniform Discretization Transform
K-means Discretization Transform
Quantile Discretization Transform

Change Data Distribution
Some machine learning algorithms may prefer or require categorical or ordinal input variables, such as some decision tree and rule-based algorithms.
Some classification and clustering algorithms deal with nominal attributes only and cannot handle ones measured on a numeric scale.
— Page 296, Data Mining: Practical Machine Learning Tools and Techniques , 4th edition, 2016.
Further, the performance of many machine learning algorithms degrades for variables that have non-standard probability distributions.
This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.
Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.
These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.
As such, it is often desirable to transform each input variable to have a standard probability distribution.
One approach is to use transform of the numerical variable...

Top