BookmarkSubscribeRSS Feed
0 Likes

My opinion is that there should NOT be new variables created due to missing value imputation. The original variable names are well enough. The process should only impute the new values in place of the missing values for these variables, and NOT create ‘IMP_’- variables,

 

From analyst point of view, there is no need on having BOTH variables included within your analysis; i.e. imputed and non-imputed values. You always use only one of theses two.

2 Comments
ballardw
Super User

If you have a problem with that then a data step can address that. You may not want to but other users may want to compare Imputed with actual values to see how well the imputation works. Or possibly run the data through different imputation steps to test different options or results of models using either.

 

data junk;

   set imputed;

   var = coalesce(var, imputedvar);

   /* or if the variable is character*/

  var = coalescec (var,imputedvar);

  drop imputedvar;

run;

creates a new data set with just the base variable name with missing values replaced by the imputed version.

JussiV
SAS Employee
I was not referring to skills on how to do it, but rather on the usability of our solution. The "IMP_" naming standard is useless and only creates confusion.
I'm also concerned about the size of the tables we need to generate in order to achieve this trivial trick. If both of the variables are stored we double the size of the dataset. And, in the example you given me with the coalesce function we practically have two copies of the same values in the datasets. After the coalesce statement we end up having both values the same.
And, even with the use case where analysts would like to compare the results the duplication of the variables is waste of space. If analyst would like to do that, they would first run the analysis without the imputation, and then simply add the Imputation- task into the prosess and recreate the similar analysis for that part of the analysis path/pipeline.
There is no need for analysts having BOTH imputed AND non-imputed variables in the dataset at the same proceeding task.