Handling Missing Data For Advanced Machine Learning
Throughout this article, you will become good at spotting, understanding, and imputing missing data. We demonstrate various imputation techniques on a real-world logistic regression task using Python. Properly handling missing data has an improving effect on inferences and predictions. This is not to be ignored.
The first part of this article presents the framework for understanding missing data. Later we demonstrate the most popular strategies in dealing with missingness on a classification task to predict the onset of diabetes.
Missing data is hard to avoid
A considerable part of data science or machine learning job is data cleaning. Often when data is collected, there are some missing values appearing in the dataset.
To understand the reason why data goes missing, let’s simulate a dataset with two predictors x1 , x2 , and a response variable y .
We will virtually make some data missing to illustrate various reasons why many real-world datasets may contain missing values.
There are 3 major types of missing values to be concerned about.
Missing Completely at Random (MCAR)
MCAR occurs when the probability of missing values in a variable is the same for all samples.
For example, when a survey is conducted, and values were just randomly missed when being entered in the computer or a respondent chose not to respond to a question.
There is no effect of MCAR on inferences made by a model trained on such data.
To illustrate MCAR, we randomly remove half of the values for x1 as follows.
## Missing Completely at Random (MCAR)
# randomly mark half of x1 samples as missing MCAR
# independend of any information recorded
idx_mcar= np.random.choice([0, 1], size=(100,)) == 1
plt.scatter(x1[idx_mcar],y[idx_mcar], label='missing (MCAR)', color='red')
plt.title('Missing Completely at Random (MCAR)');
x1 samples (red-marked), which are missing completely at random (MCAR), depend neither on x1 values nor on the values of any other measured variables.
Missing at Random (MAR)
The probability of missing values, at random, in a variable depends only on the available information in other predictors.
For example, when men and women respond to the question “have you ever taken parental leave?”, men would tend to ignore the question at a different rate compared to women.
MARs are handled by using the information in the other predictors to build a model and impute a value for the missing entry.
We simulate MAR by removing x1 values depending on x2 values. When x2 has the value 1, then the corresponding x1 is missing.
## Missing at Random (MAR)
# randomly mark half of x1 samples as...