Suppose we have a case with low event rate, say 1%. In 1:1 oversample we select all events 1% and 1% of non-events. Then there are two important questions:
How can we guarantee that this 1% of non-events is a representative sample of the rest 98% of non-events?
What procedures can be used to test that distributions of predictors are the same in 1% and 98% non-event samples?
Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!
In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).
Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).
Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:
1. Collapse levels before sampling
2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)
This is what I usually do when developing models within the financial sector (mainly scorecards to predict risk of default):
1. Use random sampling to select the 1% non-events; usually that guarantees you get a representative sample
2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...
3. Usually I do sampling with Proc SurveySelect
4. I usually check the 1% sample against the entire "non-events" population; one simple approach is to compare the distribution on key variables, based a similar rational as in step (2). Differences can be assessed either by using a Chi-square test or Information Value
Thanks for you suggestion.
2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...
Stratification on predictors is a tricky thing. On a variable selection stage of modeling it could be 50 or more predictors. Assume that you bin them and you have 10 bins for each variable, then this 1% of data should have 10^50 observations, that is an astronomic number.
So more practical is to compare distributions in 1% and 98% sample and resample if one of the predictors did not pass the test.
Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!
In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).
Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).
Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:
1. Collapse levels before sampling
2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)
1)Since 1% is simple random sample from 98%,I could image 1% represents 98% .
But I would like to sample more data from 98% ,like 5% (i.e. good:bad = 4:1 or 3:1 )
I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.
2)That is why you need option PProb= of MODEL to adjust predictive Prob.
Thanks for your information.
I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.
Could you please give a reference to this paper?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.