BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ronein
Onyx | Level 15
Hello
Let's say that I want to build a logistic regression model that fit the pd ( probability of default) for specific population in bank. Let's say that policy of bank is that all explanatory variables will be categorical variables.
Let's say that I have a list of potential explanatory variables and some of them are numerical and some categorical.
Question1:
I want to categorize the numerical explanatory variables.
What technique is recommended in order to determine how many categories will be and how to define each category.
Question 2:
After the categorization process of the potential explanatory variables I want to put all of them and find the combination of them that provide the best model.
Let's say that the best model criteria is Gini. What is the way to do it in sas ( Give sas all potential explanatory variables and sas will tell me the best combination of them that provide max Gini)

Thank you
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

I would pick up “ 4 categories” if they have same IV .
Due to more groups would keep more details of variable , i.e. would get more info about variable or less lossed info for the variable .

P.S. less group would lost more info about variable, that is the reason why statistical refuse to bin variable as @Rick_SAS said before. But For Score Card would better to explain model .

View solution in original post

4 REPLIES 4
PaigeMiller
Diamond | Level 26

There's no such thing as a universally accepted method here, there's no such thing as "best", and while there might be theoretically such a thing as the model that has the highest "Gini", you may never find it, as there are too many possibilities so that you can't try them all.

 

Each step of the way produces too many choices/options that you can try, and so you can't realistically try them all. For example, each step requires decisions:

 

  • What to do about missing values
  • What to do about outliers
  • What binning method
  • What options in the binning
  • What modeling method (stepwise, not-stepwise, decision trees, gradient boosting, random forest, neural network, Partial Least Squares)
  • What options within the method

Recently, I was able to fit 12 different models, because in SAS Enterprise Miner or SAS Viya Model Studio you can do this relatively quickly. Once you learn the interface, the selection of different modeling methods and options goes relatively quickly. I was able to do this in about 2.5 hours (including creating the diagram, removing outliers, imputing values for missings, detecting and handling outliers, running all the models and then comparing them). I added two models using Logistic Partial Least Squares (which is not available in SAS). But ... although I fit 14 models, perhaps the 15th one that I didn't try would have been better. I will never know. It is impossible to know.

 

I wound up choosing simple outlier strategies, simple missing value strategies (I didn't do binning, but if you are going to, make a choice and go with it). But for all these decisions, select one or two methods and go with it. Don't try to model every possible choice of binning, outliers, missing and stepwise or other options.

 

To do the binning, you can try PROC HPBIN (or if you have Enterprise Miner or Model Studio, there is an equivalent node), but you have to select the proper method of binning and the proper options within that method.

 

I think (as opposed to the above advice about what modeling method to use that there is no universal agreement), there is almost universal agreement that you should NOT put all variables into the model. There needs to be some variable selection/reduction step, unless you use something like Stepwise or Logistic Partial Least Squares, in which case a separate variable selection step is not needed. Stepwise however has its own set of issues, if you search for "problems with Stepwise Regression" you will see what I mean.

 

I mentioned Enterprise Miner and Viya Model Studio. If you are going to program this yourself ... not recommended. You might as well block off the next three months to get all this programmed yourself, and plan to work through lunch and pull your hair out.

 

In a recent thread, someone else asked why is SAS still used and not Python or R. This is an example where SAS has major advantages over programming languages such as Python and R (not mentioned in that thread).

 

 

 

 

--
Paige Miller
Ronein
Onyx | Level 15
I want to categorize specific potential exploratory variable.
The criteria that I check is IV(Information value).
Is it better to have more categories or less categories ( if both provide similar IV). For example: grouping to 3 categories or 4 categories provide similar IV. Which option is better to use as explanatory variable in model ?

PaigeMiller
Diamond | Level 26

@Ronein wrote:

Is it better to have more categories or less categories ( if both provide similar IV). For example: grouping to 3 categories or 4 categories provide similar IV. Which option is better to use as explanatory variable in model ?


I don't think there is a general answer to this question. I think the answer depends on the data, and so you should try it different ways.

--
Paige Miller
Ksharp
Super User

I would pick up “ 4 categories” if they have same IV .
Due to more groups would keep more details of variable , i.e. would get more info about variable or less lossed info for the variable .

P.S. less group would lost more info about variable, that is the reason why statistical refuse to bin variable as @Rick_SAS said before. But For Score Card would better to explain model .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1054 views
  • 1 like
  • 3 in conversation