Linear Discriminant Analysis for Dimensionality Reduction in Python


Reducing the number of input variables for a predictive model is referred to as dimensionality reduction.
Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data.
Linear Discriminant Analysis , or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also be used as a dimensionality reduction technique, providing a projection of a training dataset that best separates the examples by their assigned class.
The ability to use Linear Discriminant Analysis for dimensionality reduction often surprises most practitioners.
In this tutorial, you will discover how to use LDA for dimensionality reduction when developing predictive models.
After completing this tutorial, you will know:

Dimensionality reduction involves reducing the number of input variables or columns in modeling data.
LDA is a technique for multi-class classification that can be used to automatically perform dimensionality reduction.
How to evaluate predictive models that use an LDA projection as input and make predictions with new raw data.

Let’s get started.

Linear Discriminant Analysis for Dimensionality Reduction in Python Photo by Kimberly Vardeman , some rights reserved.

Tutorial Overview
This tutorial is divided into four parts; they are:

Dimensionality Reduction
Linear Discriminant Analysis
LDA Scikit-Learn API
Worked Example of LDA for Dimensionality

Dimensionality Reduction
Dimensionality reduction refers to reducing the number of input variables for a dataset.
If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features.
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.
In a dataset with k numeric attributes, you can visualize the data as a cloud of points in k-dimensional space …
— Page 305, Data Mining: Practical Machine Learning Tools and Techniques , 4th edition, 2016.
Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.
This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “ curse of dimensionality .”
Therefore, it is often desirable to reduce the number of input features. This reduces the number of dimensions of the feature space, hence the name “ dimensionality reduction...

Top