Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.
However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?
My response:
Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.
Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.
There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.