BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Applied Analytics Using SAS Enterprise Miner

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

 

My response:

Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.

Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.

There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.

 

View solution in original post

1 REPLY 1
gcjfernandez
SAS Employee

Taking as an example the process flow on page 4-20 of the course text, my understanding is that values imputed (e.g. means or medians) are calculated based on the training dataset and used on the validation/test/score datasets.

However, if oversampling is used, are those values biased? If so, should they not be adjusted for oversampling or is it valid/correct to use them as they are because those are the records also used for fitting the model?

 

My response:

Please see my previous responses related to oversampling: Oversampling doesn't interfere in model selection or model estimation. Only the estimated posterior probabilities needs adjustment (Intercept shift) based on Prior.

Therefore,if missing values are imputed based on training data mean or median (a constant) there is no need to adjust for over sampling.

There are several non-constant missing value imputation methods are also available (Tree based methods, weighted regression method (Huber, Tukey) in SAS Enterprise miner and users can easily test these methods and pick the suitable ones based on their data.

 

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 408 views
  • 0 likes
  • 2 in conversation