How to Use Power Transforms With scikit-learn


Machine learning algorithms like Linear Regression and Gaussian Naive Bayes assume the numerical variables have a Gaussian probability distribution.
Your data may not have a Gaussian distribution and instead may have a Gaussian-like distribution (e.g. nearly Gaussian but with outliers or a skew) or a totally different distribution (e.g. exponential).
As such, you may be able to achieve better performance on a wide range of machine learning algorithms by transforming input and/or output variables to have a Gaussian or more-Gaussian distribution. Power transforms like the Box-Cox transform and the Yeo-Johnson transform provide an automatic way of performing these transforms on your data and are provided in the scikit-learn Python machine learning library.
In this tutorial, you will discover how to use power transforms in scikit-learn to make variables more Gaussian for modeling.
After completing this tutorial, you will know:

Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian probability distribution.
Power transforms are a technique for transforming numerical input or output variables to have a Gaussian or more-Gaussian-like probability distribution.
How to use the PowerTransform in scikit-learn to use the Box-Cox and Yeo-Johnson transforms when preparing data for predictive modeling.

Let’s get started.

How to Use Power Transforms With scikit-learn Photo by Ian D. Keating , some rights reserved.

Tutorial Overview
This tutorial is divided into five parts; they are:

Make Data More Gaussian
Power Transforms
Sonar Dataset
Box-Cox Transform
Yeo-Johnson Transform

Make Data More Gaussian
Many machine learning algorithms perform better when the distribution of variables is Gaussian.
Recall that the observations for each variable may be thought to be drawn from a probability distribution. The Gaussian is a common distribution with the familiar bell shape. It is so common that it is often referred to as the “ normal ” distribution.
For more on the Gaussian probability distribution, see the tutorial:

Continuous Probability Distributions for Machine Learning

Some algorithms like linear regression and logistic regression explicitly assume the real-valued variables have a Gaussian distribution. Other nonlinear algorithms may not have this assumption, yet often perform better when variables have a Gaussian distribution.
This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.
There are data preparation techniques that can be used to transform each variable to make the distribution Gaussian, or if not Gaussian, then more Gaussian like.
These transforms are most effective when...

Top