How to Create Custom Data Transforms for Scikit-Learn
The scikit-learn Python library for machine learning offers a suite of data transforms for changing the scale and distribution of input data, as well as removing input features (columns).
There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.
The scikit-learn library provides a way to wrap these custom data transforms in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.
In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.
After completing this tutorial, you will know:
That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
How to develop and apply a custom transform to remove columns with few unique values.
How to develop and apply a custom transform that replaces outliers for each column.
Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book , with 30 step-by-step tutorials and full Python source code.
Let’s get started.
How to Create Custom Data Transforms for Scikit-Learn Photo by Berit Watkin , some rights reserved.
Tutorial Overview
This tutorial is divided into four parts; they are:
Custom Data Transforms in Scikit-Learn
Oil Spill Dataset
Custom Transform to Remove Columns
Custom Transform to Replace Outliers
Custom Data Transforms in Scikit-Learn
Data preparation refers to changing the raw data in some way that makes it more appropriate for predictive modeling with machine learning algorithms.
The scikit-learn Python machine learning library offers many different data preparation techniques directly, such as techniques for scaling numerical input variables and changing the probability distribution of variables.
These transforms can be fit and then applied on a dataset or used as part of a predictive modeling pipeline, allowing a sequence of transforms to be applied correctly without data leakage when evaluating model performance with data sampling techniques, such as k-fold cross-validation .
Although the data preparation techniques available in scikit-learn are extensive, there may be additional data preparation steps that are required.
Typically, these additional steps are performed manually prior to modeling and require writing custom code. The risk is that these data preparation steps may be performed inconsistently.
The solution is to create a custom data transform in scikit-learn using the FunctionTransformer class .
This class allows you to specify a function that is called to transform the data. You can define the function and perform any valid change, such as...