kNN Imputation for Missing Values in Machine Learning

Datasets may have missing values, and this can cause problems for many machine learning algorithms.
As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.
A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as “ nearest neighbor imputation .”
In this tutorial, you will discover how to use nearest neighbor imputation strategies for missing data in machine learning.
After completing this tutorial, you will know:

Missing values must be marked with NaN values and can be replaced with nearest neighbor estimated values.
How to load a CSV file with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
How to impute missing values with nearest neighbor models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

Let’s get started.

kNN Imputation for Missing Values in Machine Learning Photo by portengaround , some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

k-Nearest Neighbor Imputation
Horse Colic Dataset
Nearest Neighbor Imputation With KNNImputer

KNNImputer Data Transform
KNNImputer and Model Evaluation
KNNImputer and Different Number of Neighbors
KNNImputer Transform When Making a Prediction

k-Nearest Neighbor Imputation
A dataset may have missing values.
These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark “ ? “.
Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or unavailability.
Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.
It is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.
… missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.
— Page 42, Applied Predictive Modeling , 2013.
An effective...