Solved: Values used for imputation of missing values

pvareschi · Posted 05-24-2020 02:16 PM

Re: Applied Analytics Using SAS Enterprise Miner

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

gcjfernandez · Posted 05-25-2020 03:35 PM

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

My response:

Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.

Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.

There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.

View solution in original post

gcjfernandez · Posted 05-25-2020 03:35 PM

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

My response:

Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.

Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.

There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.

Values used for imputation of missing values

Re: Values used for imputation of missing values

Re: Values used for imputation of missing values

Values used for imputation of missing values

Re: Values used for imputation of missing values

Re: Values used for imputation of missing values

SAS Training: Just a Click Away