BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SPR
Quartz | Level 8 SPR
Quartz | Level 8

Suppose we have a case with low event rate, say 1%. In 1:1 oversample we select all events 1% and 1% of non-events. Then there are two important questions:

How can we guarantee that this 1% of non-events is a representative sample of the rest 98% of non-events?

What procedures can be used to test that distributions of predictors are the same in 1% and 98% non-event samples?

1 ACCEPTED SOLUTION

Accepted Solutions
pvareschi
Quartz | Level 8

Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!

In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).

Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).

Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:

1. Collapse levels before sampling

2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)

View solution in original post

7 REPLIES 7
pvareschi
Quartz | Level 8

This is what I usually do when developing models within the financial sector (mainly scorecards to predict risk of default):

1. Use random sampling to select the 1% non-events; usually that guarantees you get a representative sample

2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...

3. Usually I do sampling with Proc SurveySelect

4. I usually check the 1% sample against the entire "non-events" population; one simple approach is to compare the distribution on key variables, based a similar rational as in step (2). Differences can be assessed either by using a Chi-square test or Information Value

SPR
Quartz | Level 8 SPR
Quartz | Level 8

Thanks for you suggestion. 

 

2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...

 

Stratification on predictors is a tricky thing. On a variable selection stage of modeling it could be 50 or more predictors. Assume that you bin them and you have 10 bins for each variable, then this 1% of data should have 10^50 observations, that is an astronomic number.

So more practical is to compare distributions in 1% and 98% sample and resample if one of the predictors did not pass the test.

pvareschi
Quartz | Level 8

Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!

In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).

Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).

Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:

1. Collapse levels before sampling

2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)

Ksharp
Super User

1)Since 1% is simple random sample from 98%,I could image 1% represents 98% .
But I would like to sample more data from 98% ,like 5% (i.e. good:bad = 4:1 or 3:1 )
I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.

2)That is why you need option PProb= of MODEL to adjust predictive Prob.

SPR
Quartz | Level 8 SPR
Quartz | Level 8

Thanks for your information.

 

I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.

 

Could you please give a reference to this paper? 

 

Ksharp
Super User

Sorry. I lost it. 

If I was right, it is just sample 1% from the center of cluster.

Or @StatDave   @Rick_SAS  have some clue .

SPR
Quartz | Level 8 SPR
Quartz | Level 8
Do you mean clustering observations or variables?
I can imagine the following approach: cluster predictors and select the best (minimum 1-R^2) predictor from each cluster, that could dramatically reduce number of potential predictors. Than create 1% sample of non event stratified by those best predictors.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 962 views
  • 3 likes
  • 3 in conversation