BookmarkSubscribeRSS Feed
moshe_ke
Fluorite | Level 6

Hi ,

I'm using logistic regression in Enterprise Miner (14.1) to predict a binary outcome.

I have a population of about 500,000 and 20 explanatory variables.

All explanatory variables are nominal with about 3 to 10 possible values each.

Due to the large number of combinations (3 values of var1 * 5 values of var2 * 8 values of var3 * ….) each combination contains very small sub-population sometimes no more than 2 to 5 observations. These small group affect the prediction and the stability of the model.

Is there a way to force minimum number of observations in a subgroup (=combination of variable values) in logistic regression (something similar to the minimum observations in a leaf of a decision tree)?

How to deal with this problem? Overall the affected number of observations is small but since this model is used for credit scoring this exceptions do raise questions from the sales persons that uses the outcome of the model.

Best regards
Moshe

6 REPLIES 6
Reeza
Super User

Collapse levels in your variables to ensure you have larger groups. I would apply this by looking at vars that make sense to collapse rather than a statistical method. 

WendyCzika
SAS Employee

You could use the Interactive Binning node to combine all rare levels (based on the freq. percentage you specify) into a single level, or use the Filter node to drop the obs. with rare levels.  

ballardw
Super User

I saw 20 explanatory and with just two levels be variable you have roughly 1 million combinations, or twice the sample size for just one observation per combination. That tells me the design has other issues related to planning. Was this designed to analyze the reactions among 20 variables or did 20 just happen to be what was available and someone said "dump them all into the model and see what we get?"

 

 

moshe_ke
Fluorite | Level 6

Hi and thanks for the replay ,

Its goes more to the second choice.


Are there's method to choose the number of the explanatory variables and the amount of groups to split each variables?
Is there a minimum limit for the smallest combination?
Thanks again
Moshe

Reeza
Super User

@moshe_ke wrote:

 


Are there's method to choose the number of the explanatory variables 


Yes, there are various selection methods available for model building. They each have their benefits, I would consider doing some intro modelling in BASE SAS before venturing to EM. There are a lot of options availabe in Base or EG and they're more user friendly than EM. Additionally, they have larger user base so you'll get questions answered faster.

 


@moshe_ke wrote:

the amount of groups to split each variables?


No. This is always subject and data specific. You can use automated tools for binning, but remember since this is being used with people, you're going to have to be able to explain this. 20 variables isn't that much and you can explore the distribution individually to help make your decisions.

 


@moshe_ke wrote:


Is there a minimum limit for the smallest combination?


No, but there's a rule of thumb, ~25 observations per level of a variable. So if you have 20 variables with each having 5 levels that's 

20*5 * 25 = 2500 observations minimum required. BUT that's a generalization, and if your subgroups get smaller when combined then there are issues with a sparse matrix.  If your event observed is rare, then the numbers need to be bigger.

 

I'm not sure how familiar you are with Logistic Regression in general, but this is a basic walk through

http://www.ats.ucla.edu/stat/sas/dae/logit.htm

moshe_ke
Fluorite | Level 6

Thanks again ,

 

I used several of yours comments to improve the model.

 

The best chosen model for my population was the forest with AUC = 0.81 and misclssification rate = 0.058.

 

Do these results reflect good predictive ability of the model? I dont have beanchmark.

 

Best regards,

 

Moshe

 

 

 

 

 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1864 views
  • 2 likes
  • 4 in conversation