Ordinal and One-Hot Encodings for Categorical Data

Machine learning models require all input and output variables to be numeric.
This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.
The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding .
In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.
After completing this tutorial, you will know:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Let’s get started.

Ordinal and One-Hot Encoding Transforms for Machine Learning Photo by Felipe Valduga , some rights reserved.

Tutorial Overview
This tutorial is divided into six parts; they are:

Nominal and Ordinal Variables
Encoding Categorical Data

Ordinal Encoding
One-Hot Encoding
Dummy Variable Encoding

Breast Cancer Dataset
OrdinalEncoder Transform
OneHotEncoder Transform
Common Questions

Nominal and Ordinal Variables
Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.
Categorical data are variables that contain label values rather than numeric values.
The number of possible values is often limited to a fixed set.
Categorical variables are often called nominal.
Some examples include:

A “ pet ” variable with the values: “ dog ” and “ cat “.
A “ color ” variable with the values: “ red “, “ green “, and “ blue “.
A “ place ” variable with the values: “ first “, “ second “, and “ third “.

Each value represents a different category.
Some categories may have a natural relationship to each other, such as a natural ordering.
The “ place ” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.
A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

Nominal Variable ( Categorical ). Variable comprises a finite set of discrete values with no relationship between values.

Ordinal Variable . Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

Top Tech News