Imputation: what to use and when?

1 Like

Dealing with missing data is a challenge in statistical modeling. Whether data are from a designed experiment or operational data from a business, most data sets have missing values. Missing values are problematic because most analyses discard any row of your data in which a variable used in the model has a missing value. Even with a small number of missing values, this can lead to enormous data loss. Many analysts prefer to impute, that is, to fill in missing values with (hopefully) reasonable proxies. In some cases, researchers will simply impute fixed-values such as a mean or median of the nonmissing values for continuous variables. In other situations, more sophisticated techniques such as cluster imputation, regression imputation, or multiple imputation are used. In this post, I will describe the situations in which different imputation methods are appropriate and demonstrate simple and quick fixed-value imputation techniques that may be appropriate for predictive modelers. In a follow up post, I will demonstrate imputation techniques geared towards statistical inference practitioners.

The default approach to handling missing values: complete case analysis

Most statistical and machine learning models use complete case analysis (CCA, also called listwise deletion), meaning only rows that have complete data for predictors and the response are used in the analysis. For many situations, this will result in an unacceptable loss of data. With even a small percentage missing data, say 1% of the individual values, CCA can sometimes result in 2/3 of the rows being dropped from analyses. Additionally, CCA may bias the results depending on why the data are missing. So why is this the default approach for statistical modeling? When the missing values are missing completely at random (MCAR, described below), complete-case analysis yields unbiased parameter estimates. But even with data being MCAR (a strong assumption about the data), CCA will reduce sample size and therefore increase standard errors, increase the width of confidence and prediction intervals, and reduce statistical power. Under other types of missingness, CCA will typically yield biased parameter estimates.

Imputing missing values

Imputation is often a better alternative to complete case analysis for handling missing data. Unless there is a very small amount of missing data, imputation is generally necessary, particularly if the goal is scoring new data. For predictive modeling, even if the model development data is nearly complete, most models cannot make predictions if the data to be scored is missing any values. So, imputation will likely be necessary. For explanatory modeling, imputation can be a better alternative than CCA when data are missing at random (described below), which can result in biased parameter estimates and inflated standard errors. Despite appropriate imputation methods being beneficial to explanatory modelers, several authors have noted that published research analyses don’t always acknowledge the presence of missing data and how it was handled, suggesting that there is wide-spread uncertainty on how to impute data. This is probably because imputation for inference typically requires more sophisticated techniques such as regression imputation or multiple imputation.

When is imputation unnecessary?

Some models do not need imputation because they do not use CCA. Decision trees and tree-based models such as random forests and gradient boosting have built in ways of handling missing data. Some analysts feel that decision trees are not as interpretable as regression models, but several built in methods for handling missing values might make them a good alternative. The SAS Viya procedure PROC TREESPLIT can be used for fitting decision tree models.

Some models can estimate parameters through maximum likelihood without discarding incomplete rows. They use all the available data to find the parameters that are most likely to have generated the observed data, without actually imputing the missing values. This involves jointly maximizing the likelihoods based on complete rows and the likelihoods based on incomplete rows. PROC MIXED and PROC GLIMMIX can produce these kinds of maximum likelihood estimates when data have missing response variable values but not missing predictors. For data with missing predictors, PROC MI can use an expectation-maximization algorithm to produce maximum likelihood estimates without imputing or discarding data. PROC MI will be discussed in a follow up post on multiple imputation.

When data are missing not at random (described below), bias in analyses based on imputed data (including by multiple imputation) may be larger than biases introduced by CCA.

Appropriate imputation depends on the analysis goal and the missing data mechanism

Is your work focused on understanding the risk factors and their relative importance for Type-2 diabetes? Or is your work focused on predicting who is likely to donate to your charity when they receive a solicitation? The best approach for imputing missing values will differ between these situations. It will depend on your research goal as well as the missing data mechanisms.

So, I’m going to dichotomize the goals of statistical modeling into prediction and statistical inference (also called “explanatory modeling”). Business analysts are typically interested in getting the most accurate predictions possible from their models. Examples might include predicting which transactions are fraudulent, or which customers are likely to churn and cancel services with a particular business. Parameter estimates and p-values are of secondary importance, if they are of interest at all.

Statistical inference practitioners are typically focused on calculating unbiased parameter estimates to explain the relationships among variables. These explanatory modelers are often interested in constructing confidence intervals around these parameter estimates and carrying out significance tests. While prediction may be an eventual goal, that comes much later than parameter estimation and hypothesis testing and is often in service of establishing causal relationships.

Why does this distinction matter for imputation? The purposes of imputation are:

To reduce the amount of data lost due to missing values and CCA.
To reduce the bias in parameter estimates which can occur with certain kinds of missing data.

So, it’s worth considering if the focus of your work is unbiased parameter estimation or pure prediction. Predictive modelers will likely be most concerned with data loss. Explanatory modelers will probably be most concerned with unbiased parameter estimates, but data loss and therefore statistical power may also be an important concern.

There are several common imputation techniques used for these goals. They include:

Fixed-value imputation, such as imputing the mean, median, or mode of the nonmissing values.
Cluster imputation, a type of fixed-value imputation, in which the cluster-specific fixed-value (e.g., mean) is imputed.
Regression imputation which uses regression to predict the missing values of one predictor based on the values of other predictors. The imputed values for a variable are not fixed and could be unique for each observation. This could also be performed using decision trees or other machine learning models.
Hot-deck imputation (and similar approaches such as predictive mean matching) in which similar observations are found, and imputation is based on the nonmissing values of these similar observations.
Multiple imputation in which multiple complete data sets are created with different values imputed in each set. The sets are analyzed in combination using specialized techniques.

The goal of modeling (prediction or inference) along with the type of missingness will determine which of these imputation approaches is a good choice for your work. So next, I’ll explain the types of missing values. Warning: the terminology is non-intuitive and, in my opinion, poorly chosen but unfortunately it is well established. Also, there is usually no good way to know with certainty which of these missingness mechanisms applies to your variables with missing values. You need to know your data well to make good guesses. Keep in mind that the type of missingness for variable-1 in your data may be different than the type of missingness for variable-2, so the missing data mechanisms should be evaluated separately for each variable.

Missing Completely at Random (MCAR). This is the easiest missingness to deal with. MCAR means the chance a value is missing from your data set is not related to its (unknowable) value or to any other variable inside or outside of your data set. The loss of the data is 100% random and losing these data will therefore not bias any parameters (but will reduce sample size and therefore power). For many kinds of data, MCAR may be an unrealistic assumption. Several traditional missing data techniques such as CCA are valid only if the MCAR assumption is met. If the observed data are a random subset of the full data, CCA gives essentially the same results as the full data set would have.

Examples of MCAR

A researcher drops petri dishes from a randomized experiment. Since the dishes were randomized to trays, a purely random sample of experimental units was lost.
A patient in a clinical trial forgot to report symptoms and it has nothing to do with their health status or drug treatment effects.
A manufacturing sensor fails to record data because of a power outage in the building, and the missing values are not related to the production process.

Missing at Random (MAR). This is neither the easiest nor the hardest situation to deal with. MAR means there is a relationship between the missingness and other variables in your data set (but not to unobserved data). In other words, the missingness can be predicted from other variables in your data. In MAR, there is no relationship between the missing values themselves and the chance they are missing (after adjusting for the observed variables). This is often considered more plausible for many kinds of missing data than MCAR. Some researchers have suggested using a t-test comparing the missing value group to the non-missing value group for other continuous predictors as a test for MAR. Logistic regression with missing status of a variable as the target can also be used to detect predictor-missingness associations that suggest data are MAR.

Since MAR is easier to deal with than MNAR (next section), it can be beneficial to have other variables that can help predict the missingness. These variables can be used in various imputation models such as regression imputation or multiple imputation. These imputation techniques (as well as the maximum likelihood and EM algorithm approaches) can reduce parameter estimation bias when data are MAR. These are all good approaches for explanatory modelers with MAR data.

Examples of MAR

In a customer survey, people with the highest reported education level are more likely to not report their income and the missingness of income data is not related to the value of income.
In a clinical trial, participants who have longer commute times to a clinic miss scheduled lab tests more frequently, but the missing lab results are unrelated to their actual health condition.
In an employee satisfaction report, employees in certain departments tend to skip questions about job satisfaction more often, but the missingness is unrelated to their actual satisfaction level.
Women were less likely to report their weight in a survey than men, and you have complete information on gender of the survey respondents.

Missing Not at Random (MNAR). This is the most complicated situation to deal with. In MNAR, the chance a value is missing depends on that value itself or on other variables that are not contained in your data. In other words, there is a definite pattern to the missing values, but that pattern is unrelated to any observed data. This can heavily bias your parameter estimates and SE’s. Imputing for MNAR generally requires using advanced methods that depend heavily on assumptions about the unknown information, and there is no way to verify that these assumptions are correct. Because of this, several analyses can be done with different assumptions to show how sensitive the results are to these assumptions. When data are MNAR, researchers may just need to get new data or more data.

Examples of MNAR

Respondents with highest incomes are least likely to report their incomes and the missingness is uncorrelated with other observed data.
In a clinical trial, patients with severe side effects from a drug drop out of a study, leaving missing data related to the severity of their condition or treatment response.
In an employee satisfaction survey, employees dissatisfied with their new reporting requirements are less likely to respond to the survey. Their missing responses are directly related to their dissatisfaction.

Fixed-value imputation for predictive modelers

For predictive modelers, fixed-value imputation (e.g., mean) is often an acceptable imputation approach. When the data are MCAR, this will bias standard errors downward but will not bias parameters such as the sample mean. Regression coefficients may be biased downward since the response variable was not used in the imputation (this can happen because the imputed mean X can be paired with any value of Y, weaking the X-Y relationship). When data is MAR, mean imputation can lead to biased parameter estimates and can weaken or distort the relationships between variables. It also inflates the confidence in parameter estimates because it doesn't account for the uncertainty introduced by imputing missing data. These problems are worse when data is MNAR.

But when the goal is pure prediction, biased parameters are not as extreme a problem as they are for explanatory modeling. In fact, in predictive modeling, often shrinkage methods are employed such as LASSO and Elastic Net regression that intentionally bias parameters to reduce prediction error. With that in mind, the benefits of fixed-value imputation may outweigh the costs.

Fixed-value imputation is very fast, freeing up the analyst’s time for other parts of modeling or developing additional models. This is not a statistical but a practical advantage. For some business needs, a good model produced today is better than an excellent model produced next month. And while more sophisticated approaches such as multiple imputation may reduce bias in parameters (when data is MAR), these approaches may drastically increase computational power needed for the large data sets typically used in predictive modeling.

Fixed-value imputation can be paired with missing indicator variables. These binary constructed variables are each related to a predictor that is missing values. When X1 is missing a value, the missing indicator for X1 takes a value of 1, otherwise its value is zero. These missing indicators can potentially capture associations between the missingness of the predictors and the target. Missing indicators are created automatically by the SAS Viya regression procedures GENSELECT, LOGSELECT, PHSELECT, QTRSELECT, and REGSELECT when the INORMATIVE keyword is added to the MODEL statement. INFORMATIVE also automatically imputes the mean for missing continuous variables.

This post has focused on imputation for continuous predictors. For nominal predictors with missing values, a common approach is to treat missing as a valid analysis level. So, a 2-level predictor (e.g., gender) with missing values, would be treated as a three-level nominal variable.

How to do fixed-value imputation

There are several ways to do fixed-value imputation in SAS. One approach is to use the SAS 9 procedure PROC STDIZE. The code below shows how to impute the median for missing values of variables var1, var2, and var3. Other options besides median imputation include imputing the mean, midrange, minimum, or values referenced from a separate data set. The REPONLY option causes the procedure to only replace missing values with the median and not to standardize the data.

proc stdize data=work.my_data method=median reponly out= work.my_data_imputed;

var var1 var2 var3;

run;

For imputing in-memory data in SAS Viya, PROC VARIMPUTE is a good choice. Below is some example code to do the trick. The CTECH option is the continuous variable imputation technique. It can also be used to impute the mean, a random value between the minimum and maximum, or a specific value.

proc varimpute data=mycas.my_data;

input var1 var2 var3 / ctech=median;

output out=mycas.my_data_imputed copyvars=(_all_);

run;

Imputed variables in the output data mycas.my_data_imputed will have the prefix IM_ (e.g., IM_var1). I especially like that PROC VARIMPUTE can impute random values uniformly distributed between the minimum and maximum. If I’m missing lots of values for one variable, this approach prevents a big spike at the median, which can be problematic for some models.

Cluster imputation

Cluster imputation imputes the cluster-specific fixed-value, such as a mean, instead of the overall mean. These imputed values may be less biased than the unconditional fixed-value. While there are many approaches to coming up with clusters (e.g., SAS 9: PROC CLUSTER, SAS Viya: PROC KCLUS), here is a simple way of “clustering” using PROC RANK. Let’s say you wanted to find clusters of individuals based on age and years on the job. You could use PROC RANK to find three age-groups, and three years-on-the-job-groups for a total of nine categories or clusters:

proc rank data= work.my_data out= work.my_data_rank groups=3;

var age yearsonjob;

ranks age_group yearsonjob_group;   *makes 9 combinations of these groups;

run;

Then you could use either of the two fixed value imputation methods along with a BY statement to impute the cluster:

proc stdize data= work.my_data_rank method=median reponly out= work.my_data_imputed_clus;

by age_group yearsonjob_group;

var var1 var2 var3;

run;

In a follow up post, I’ll demonstrate regression imputation and multiple imputation approaches. For more information on the maximum likelihood estimation that I briefly mentioned, see the SUGI 2012 paper “Handling Missing Data by Maximum Likelihood” by Paul Allison. The book “Regression Modeling Strategies” by F. Harrell has a helpful chapter on analyses with missing data.

- Tarek Elnaccash

Find more articles from SAS Global Enablement and Learning here.

Imputation: what to use and when?

Registration is open

SAS AI and Machine Learning Courses