Values used for imputation of missing values

pvareschi — Sun, 24 May 2020 18:16:49 GMT

Re: Applied Analytics Using SAS Enterprise Miner

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

Re: Values used for imputation of missing values

gcjfernandez — Mon, 25 May 2020 19:35:50 GMT

My response:

Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.

Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.

There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.

topic Re: Values used for imputation of missing values in SAS Academy for Data Science

Values used for imputation of missing values

Re: Values used for imputation of missing values