How to Perform Feature Selection With Numerical Input Data


Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.
Feature selection is often straightforward when working with real-valued input and output data, such as using the Pearson’s correlation coefficient, but can be challenging when working with numerical input data and a categorical target variable.
The two most commonly used feature selection methods for numerical input data when the target variable is categorical (e.g. classification predictive modeling) are the ANOVA f-test statistic and the mutual information statistic.
In this tutorial, you will discover how to perform feature selection with numerical input data for classification.
After completing this tutorial, you will know:

The diabetes predictive modeling problem with numerical inputs and binary classification target variables.
How to evaluate the importance of numerical features using the ANOVA f-test and mutual information statistics.
How to perform feature selection for numerical data when fitting and evaluating a classification model.

Let’s get started.

How to Perform Feature Selection With Numerical Input Data Photo by Susanne Nilsson , some rights reserved.

Tutorial Overview
This tutorial is divided into four parts; they are:

Diabetes Numerical Dataset
Numerical Feature Selection

ANOVA f-test Feature Selection
Mutual Information Feature Selection

Modeling With Selected Features

Model Built Using All Features
Model Built Using ANOVA f-test Features
Model Built Using Mutual Information Features

Tune the Number of Selected Features

Diabetes Numerical Dataset
As the basis of this tutorial, we will use the so-called “ diabetes ” dataset that has been widely studied as a machine learning dataset since 1990.
The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.
A naive model can achieve an accuracy of about 65 percent on this dataset. A good score is about 77 percent +/- 5 percent. We will aim for this region but note that the models in this tutorial are not optimized; they are designed to demonstrate feature selection schemes.
You can download the dataset and save the file as “ pima-indians-diabetes.csv ” in your current working directory.

Diabetes Dataset (pima-indians-diabetes.csv)
Diabetes Dataset Description (pima-indians-diabetes.names)

Looking at the data, we can see that all nine input variables are numerical.
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
...
We...

Top