Solved: Re: Oversample: what about non-event sample?

SPR · Posted 06-17-2020 08:34 AM

Suppose we have a case with low event rate, say 1%. In 1:1 oversample we select all events 1% and 1% of non-events. Then there are two important questions:

How can we guarantee that this 1% of non-events is a representative sample of the rest 98% of non-events?

What procedures can be used to test that distributions of predictors are the same in 1% and 98% non-event samples?

pvareschi · Posted 06-17-2020 09:45 AM

Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!

In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).

Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).

Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:

1. Collapse levels before sampling

2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)

View solution in original post

pvareschi · Posted 06-17-2020 09:02 AM

This is what I usually do when developing models within the financial sector (mainly scorecards to predict risk of default):

1. Use random sampling to select the 1% non-events; usually that guarantees you get a representative sample

2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...

3. Usually I do sampling with Proc SurveySelect

4. I usually check the 1% sample against the entire "non-events" population; one simple approach is to compare the distribution on key variables, based a similar rational as in step (2). Differences can be assessed either by using a Chi-square test or Information Value

SPR · Posted 06-17-2020 09:35 AM

Thanks for you suggestion.

2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...

Stratification on predictors is a tricky thing. On a variable selection stage of modeling it could be 50 or more predictors. Assume that you bin them and you have 10 bins for each variable, then this 1% of data should have 10^50 observations, that is an astronomic number.

So more practical is to compare distributions in 1% and 98% sample and resample if one of the predictors did not pass the test.

pvareschi · Posted 06-17-2020 09:45 AM

Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!

In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).

Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).

Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:

1. Collapse levels before sampling

2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)

Ksharp · Posted 06-18-2020 08:51 AM

1)Since 1% is simple random sample from 98%,I could image 1% represents 98% .
But I would like to sample more data from 98% ,like 5% (i.e. good:bad = 4:1 or 3:1 )
I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.

2)That is why you need option PProb= of MODEL to adjust predictive Prob.

SPR · Posted 06-18-2020 09:02 AM

Thanks for your information.

I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.

Could you please give a reference to this paper?

Ksharp · Posted 06-18-2020 09:21 AM

Sorry. I lost it.

If I was right, it is just sample 1% from the center of cluster.

Or @StatDave @Rick_SAS have some clue .

SPR · Posted 06-18-2020 09:29 AM

Do you mean clustering observations or variables?
I can imagine the following approach: cluster predictors and select the best (minimum 1-R^2) predictor from each cluster, that could dramatically reduce number of potential predictors. Than create 1% sample of non event stratified by those best predictors.