## Logistic Regression large number of combinations

Occasional Contributor
Posts: 6

# Logistic Regression large number of combinations

Hi ,

I'm using logistic regression in Enterprise Miner (14.1) to predict a binary outcome.

I have a population of about 500,000 and 20 explanatory variables.

All explanatory variables are nominal with about 3 to 10 possible values each.

Due to the large number of combinations (3 values of var1 * 5 values of var2 * 8 values of var3 * ….) each combination contains very small sub-population sometimes no more than 2 to 5 observations. These small group affect the prediction and the stability of the model.

Is there a way to force minimum number of observations in a subgroup (=combination of variable values) in logistic regression (something similar to the minimum observations in a leaf of a decision tree)?

How to deal with this problem? Overall the affected number of observations is small but since this model is used for credit scoring this exceptions do raise questions from the sales persons that uses the outcome of the model.

Best regards
Moshe

Super User
Posts: 19,868

## Re: Logistic Regression large number of combinations

Collapse levels in your variables to ensure you have larger groups. I would apply this by looking at vars that make sense to collapse rather than a statistical method.

SAS Super FREQ
Posts: 306

## Re: Logistic Regression large number of combinations

You could use the Interactive Binning node to combine all rare levels (based on the freq. percentage you specify) into a single level, or use the Filter node to drop the obs. with rare levels.

Super User
Posts: 11,343

## Re: Logistic Regression large number of combinations

I saw 20 explanatory and with just two levels be variable you have roughly 1 million combinations, or twice the sample size for just one observation per combination. That tells me the design has other issues related to planning. Was this designed to analyze the reactions among 20 variables or did 20 just happen to be what was available and someone said "dump them all into the model and see what we get?"

Occasional Contributor
Posts: 6

## Re: Logistic Regression large number of combinations

Hi and thanks for the replay ,

Its goes more to the second choice.

Are there's method to choose the number of the explanatory variables and the amount of groups to split each variables?
Is there a minimum limit for the smallest combination?
Thanks again
Moshe

Super User
Posts: 19,868

## Re: Logistic Regression large number of combinations

moshe_ke wrote:

Are there's method to choose the number of the explanatory variables

Yes, there are various selection methods available for model building. They each have their benefits, I would consider doing some intro modelling in BASE SAS before venturing to EM. There are a lot of options availabe in Base or EG and they're more user friendly than EM. Additionally, they have larger user base so you'll get questions answered faster.

moshe_ke wrote:

the amount of groups to split each variables?

No. This is always subject and data specific. You can use automated tools for binning, but remember since this is being used with people, you're going to have to be able to explain this. 20 variables isn't that much and you can explore the distribution individually to help make your decisions.

moshe_ke wrote:

Is there a minimum limit for the smallest combination?

No, but there's a rule of thumb, ~25 observations per level of a variable. So if you have 20 variables with each having 5 levels that's

20*5 * 25 = 2500 observations minimum required. BUT that's a generalization, and if your subgroups get smaller when combined then there are issues with a sparse matrix.  If your event observed is rare, then the numbers need to be bigger.

I'm not sure how familiar you are with Logistic Regression in general, but this is a basic walk through

http://www.ats.ucla.edu/stat/sas/dae/logit.htm

Occasional Contributor
Posts: 6

## Re: Logistic Regression large number of combinations

Thanks again ,

I used several of yours comments to improve the model.

The best chosen model for my population was the forest with AUC = 0.81 and misclssification rate = 0.058.

Do these results reflect good predictive ability of the model? I dont have beanchmark.

Best regards,

Moshe

Discussion stats
• 6 replies
• 758 views
• 2 likes
• 4 in conversation