How to Scale Data With Outliers for Machine Learning
Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.
This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.
Standardizing is a popular scaling technique that subtracts the mean from values and divides by the standard deviation, transforming the probability distribution for an input variable to a standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if the input variable contains outlier values.
To overcome this, the median and interquartile range can be used when standardizing numerical input variables, generally referred to as robust scaling.
In this tutorial, you will discover how to use robust scaler transforms to standardize numerical input variables for classification and regression.
After completing this tutorial, you will know:
Many machine learning algorithms prefer or perform better when numerical input variables are scaled.
Robust scaling techniques that use percentiles can be used to scale numerical input variables that contain outliers.
How to use the RobustScaler to scale numerical input variables using the median and interquartile range.
Let’s get started.
How to Use Robust Scaler Transforms for Machine Learning Photo by Ray in Manila , some rights reserved.
This tutorial is divided into five parts; they are:
Robust Scaler Transforms
IQR Robust Scaler Transform
Explore Robust Scaler Range
Robust Scaling Data
It is common to scale data prior to fitting a machine learning model.
This is because data often consists of many different input variables or features (columns) and each may have a different range of values or units of measure, such as feet, miles, kilograms, dollars, etc.
If there are input variables that have very large values relative to the other input variables, these large values can dominate or skew some machine learning algorithms. The result is that the algorithms pay most of their attention to the large values and ignore the variables with smaller values.
This includes algorithms that use a weighted sum of inputs like linear regression, logistic regression, and artificial neural networks, as well as algorithms that use distance measures between examples, such as k-nearest neighbors and support vector machines.
As such, it is normal to scale input variables to a common range as a data preparation technique prior to fitting a model.
One approach to data scaling involves calculating the mean and standard deviation of each variable and using these values to scale the values to have a mean of zero and a standard deviation of one, a so-called “ standard normal ”...