How to Avoid Data Leakage When Imputing for Predictive Modeling

1 Like

Most predictive modelers are familiar with data splitting and honest assessment. When developing a predictive model, data is typically split into training, validation, and test sets for developing and assessing models. Despite having knowledge of partitioning, many modelers make the mistake of imputing in a manner that allows data to leak from the validation and test sets into the training data. In this post, I’ll describe the problem of data leakage, explain how to impute to avoid leakage, and give advice for using SAS tools that make imputation easy to accomplish as well as some options to avoid when preparing data and modeling.

Data partitioning

First, let’s review data partitioning. The goal of predictive modeling is generalization, that is, to make predictions or score new data. So predictive modelers need to build models that function well at making predictions on new data sets. Data splitting is what best allows one to evaluate a model’s ability to accomplish this goal. When developing a predictive model, the researcher will typically split their data into a set for training the model to make predictions (training), a data set for choosing a model out of the candidates (validation), and a data set for final unbiased assessment of the champion model’s performance (test). Without data splitting, we can only assess how well the model predicts the historic data used to train the model. When we assess a model on the training data, we get an optimistically biased assessment of performance. The model will have been chosen because it does well on the particular data set used for training, which may have features not shared by data to be scored later. That’s not what we want. We want to know how accurately the model functions, how well we can expect the model to accurately make predictions when applied to previously unseen data. Data splitting allows honest assessment, that is, an unbiased assessment of the chosen model’s performance. It allows the modeler to simulate applying a model built today to new data seen tomorrow. For this reason, the modeler must ensure that any information in the validation and test sets is excluded from any type of fitting and pre-modeling processing steps such as imputation and transformations.

For a review of data partitioning using SAS Visual Analytics/Visual Statistics, see Beth Ebersol’s blog [1]. For steps on partitioning data using DATA step code and PROC SURVEY SELECT see Rick Wicklin’s blog [2].

In some cases, researchers use only training and validation sets, omitting the test set. When this occurs, it is with the understanding that the validation data model assessment measures should be considered the upper bounds of model performance. For simplicity, I’ll consider data that has been partitioned into only training and validation for the rest of this post. I’ll also assume mean imputation is the chosen imputation approach.

Data leakage

Data leakage, also called information leakage, occurs when information from the validation data contaminates the training data. This information that is introduced during training would not be accessible in a real-world application of the predictive model. This is different than a related issue, sometimes called target-leakage, in which information about the target is used as a predictor. Data leakage can lead to optimistically biased assessments of model performance, possibly making unacceptable model performance appear satisfactory. If any information from the validation data is used during training, the model gains an unfair advantage that won’t exist when it is deployed in a real-world scenario. To avoid data leakage, all data preprocessing including imputation of missing values is done solely using the training data. This simulates as closely as possible the conditions under which the model will be used in practice.

So how does leakage actually occur? Often a researcher will impute missing values prior to partitioning their data. This seems reasonable at first because imputing after partitioning requires imputing twice instead of once. But this results in leakage, with the model being trained on data that contains information that the model should not be able to see. It is effectively cheating, showing the model some of the answers it is trying to predict. Kapoor and Narayanan [3] show that several kinds of data leakage are surprisingly common in published literature across several scientific disciplines.

The best practice for imputing to avoid leakage is as follows. First, split the data into training and validation sets. For mean imputation, find the training data means, then impute the training means for missing values in both the training and validation sets.

For example, if a data set of 1 million observations was partitioned into 50% for each training and validation, and the variable means were:

Data set	Whole data (n = 1M)	Training (n = 500K)	Validation (n = 500K)
Variable mean	75	80	70

we would use 80 to replace missing values in both the training data and the validation data. When researchers impute missing values before partitioning, they are using 75 for the whole data set. This means that the model has access to information in the training data that it would not have in practice. The model is cheating by seeing information about data that should be unseen.

After partitioning then imputation, the training and validation sets may need to be recombined. Many SAS Viya statistical and machine learning modeling procedures will compute fit statistics on all data partitions if contained within the same data set.

How to impute while avoiding leakage

There are several ways to impute in SAS, and the two methods that I use most are shown below. PROC VARIMPUTE is a nice procedure for imputation using in-memory data in SAS Viya. If you’re using SAS 9, PROC STDIZE will do the job. Both allow you to impute the same values on the validation data that were used for imputing the training data.

PROC VARIMPUTE can be used easily using the point-and-click SAS Studio Imputation task. You can find it located under the SAS Viya Prepare and Explore Data tasks. After identifying variables to impute and the data sets involved, the task-generated code will need to be edited to include a CODE statement. This will produce SAS code that can impute the training means onto other data such as the validation or test sets. The SAS code produced then can be referenced in a data step using %include to impute the training means. PROC VARIMPUTE can also be used to impute medians, random numbers between the minimum and maximum values, or custom values.

proc varimpute data=mycas.training;
   input var1 var2 var3 / ctech=mean;
   output out=mycas.training_imputed copyvars=(_all_);
   code file='/home/student/imputed_vars.sas';
run;

data mycas.validation_imputed;
   set mycas.validation;
   %include '/home/student/imputed_vars.sas';
run;

PROC STDIZE has an OUTSTAT option which produces a data set containing location and scale measures as well as other computed statistics. The REPONLY option in the code below replaces missing values with the variable means, without altering non-missing values. A second PROC STDIZE step gets used with the METHOD=IN() option to replace missing validation data values with the training data means. PROC STDIZE has many additional options for transforming and imputing values.

proc stdize data=work.training method=mean reponly
            outstat=training_means;
   var var1 var2 var3;
run;

proc stdize data=work.validation reponly method=in(training_means)
            out=work.validation_imputed;
   var var1 var2 var3;
run;

Further recommendations

Many of the SAS Viya regression procedures have an informative missingness option that can automatically apply mean imputation for missing values. These procedures include GENSELECT, LOGSELECT, PHSELECT, QTRSELECT, and REGSELECT and this option can be applied by specifying the INFORMATIVE option in their MODEL statements. Applying the informative missingness option does 3 things. First, it imputes the mean for missing continuous variables. Second, it creates missing indicator variables which can potentially capture the relationship between missingness and the target. Third, it treats missing values for categorical predictors as a legitimate level for analysis. These procedures also have PARTITION statements, which allow random partitioning by the procedure or specification of a partition indicator variable if one is already present in the data. Using PARTITION results in fit statistics being calculated on both training and validation sets as well as allowing a champion model to be chosen automatically based on validation data performance.

I recommend not using the PARTITION statement and INFORMATIVE option with the same PROC step for these procedures. The informative missingness option will use the whole data set means for imputation, then randomly partition the data into training and validation sets. This leads to leakage as explained earlier. Instead, use the informative missingness option when working only with training data, not a data set that contains all partitions. When I use the PARTITION statement to identify partitions and get fit statistics on each partition, it is with data that has already been imputed, using the training means for the whole data set.

In summary, for predictive modeling, we don’t want the unconditional variable means imputed for the whole data set. Instead, we want the training data means imputed for the whole data set. Imputation needs to be done after partitioning, not on the unpartitioned data. Also, in several SAS Viya regression procedures, it is best to avoid using the PARTITON statement and the INFORMATIVE option together. The INFORMATIVE option makes sense if modeling the training data only.

Learning more

Are you interested in learning more about data preparation for predictive modeling including partitioning and imputation? Then consider taking the SAS 9 class Predictive Modeling Using Logistic Regression, which covers all the topics mentioned here (and many more!) in more detail. Not only will you learn about the care needed in data preparation, this class is great preparation for attaining the SAS Statistical Business Analyst credential (https://www.sas.com/en_us/certification/credentials/advanced-analytics/statistical-business-analyst....). If you’re modeling using SAS Viya, Supervised Machine Learning Procedures Using SAS Viya in SAS Studio is a great option too.

See you in the next SAS class!

References

[1] Beth Ebersol 2019 “Training, Validation, and Testing for Supervised Machine Learning Models” Training, Validation, and Testing for Supervised Machine Learning Models (sas.com)

[2] Rick Wicklin 2019 “Create training, validation, and test data sets in SAS” Create training, validation, and test data sets in SAS - The DO Loop

[3] Kapoor and Narayanan (2022) “Leakage and the Reproducibility Crisis in ML-based Science https://doi.org/10.48550/arXiv.2207.07048

Find more articles from SAS Global Enablement and Learning here.

How to Avoid Data Leakage When Imputing for Predictive Modeling

Ready to see what SAS Viya Copilot can do?

SAS AI and Machine Learning Courses