BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

HI everyone,

We frequently use logistic regression or log binomial regression to fit models with binary outcome variables. 

I am interested in building a scoring system based a logistic regression fitted with a number of independent variables (some continuous, some categorical).

From my reading you assign weights based on on the odds ratios of the independent predictors.

But there are many other aspect to consider: model fitting, model performance (pseudo R square?), how to best split continuous variables into categorical ones, to mention a few.

Ultimately, I am trying to use a model based on observed data to give predictions on outcomes of external data. 

Could anyone please suggest a good reading on the topic or maybe give a sample code or macro on how to do this?

Thanks. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Based on your response, my initial answer was correct. Logistic regression itself will generate a probability of developing lung disease. Assuming you then have the same data on individuals that you used to create the model, you could then 'score' new people and determine their probability of developing lung disease. 

 

Different ways of scoring data after your model is developed is covered here:

https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html

 

An alternative approach is a decision tree as it gives you splits more along human readable rules.

If you have access to HP proc check out this post, I'd actually recommend working through it if you can. 

https://blogs.sas.com/content/sgf/2020/08/27/build-a-decision-tree-in-sas/

 


@ammarhm wrote:
I see your point. So you prefer using the predictions from the logistic model and entering continuous variables as is? How do you save the model for future use for prediction?

Good Luck!

View solution in original post

16 REPLIES 16
Ksharp
Super User

Calling @Rick_SAS 

Here is my paper about WOE grouping .

Attachment is famous German credit card dataset for my paper.


"Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm"
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1808-2018.pdf

PaigeMiller
Diamond | Level 26

"Intelligent Credit Scoring" by Naeem Siddiqi

--
Paige Miller
PaigeMiller
Diamond | Level 26

Also, SAS has a Credit Scoring product (that might not be the exact name) written by Naeem Siddiqui based upon his book, so you wouldn't have to write your own code to do this.

--
Paige Miller
Ksharp
Super User
Yes. It is under SAS/EM , but need buy it individually .
Reeza
Super User

From my reading you assign weights based on on the odds ratios of the independent predictors.

I'm not sure what that means. The model gives you parameter estimates that you then use to create the prediction. 

 

You're looking for a tutorial on logistic regression in SAS?

My suggestion would be first to work through the basic examples (first 3 or 4) in the Logistic Regressio documentation to make sure your code is working and you get the same results. 

https://documentation.sas.com/?docsetId=statug&docsetVersion=15.1&docsetTarget=statug_logistic_examp...

 

Then search lexjansen.com for papers on logistic regression. 

 

Last but not least, the first SAS Statistics e-course is free and covers logistic regression (I think, don't hold me to that but it should list it in the table of contents). 

 


@ammarhm wrote:

HI everyone,

We frequently use logistic regression or log binomial regression to fit models with binary outcome variables. 

I am interested in building a scoring system based a logistic regression fitted with a number of independent variables (some continuous, some categorical).

From my reading you assign weights based on on the odds ratios of the independent predictors.

But there are many other aspect to consider: model fitting, model performance (pseudo R square?), how to best split continuous variables into categorical ones, to mention a few.

Ultimately, I am trying to use a model based on observed data to give predictions on outcomes of external data. 

Could anyone please suggest a good reading on the topic or maybe give a sample code or macro on how to do this?

Thanks. 

 


 

PaigeMiller
Diamond | Level 26

@Reeza wrote:

From my reading you assign weights based on on the odds ratios of the independent predictors.

I'm not sure what that means. The model gives you parameter estimates that you then use to create the prediction. 

 

You're looking for a tutorial on logistic regression in SAS?



Well, that's not my impression from reading the original post. I think the OP is talking about Credit Scoring, which involves logistic regression (or other modeling, like Genetic Algorithm, Partial Least Squares, etc.) but then another methodology is applied to model to obtain the scoring model. I think the OP is asking for help with the Credit Scoring part, not the logistic regression part.

--
Paige Miller
Reeza
Super User
You're probably right, is the Credit scoring assumption based on other questions?
PaigeMiller
Diamond | Level 26

Let's let @ammarhm clarify the scope of his question.

--
Paige Miller
ammarhm
Lapis Lazuli | Level 10
Thanks everyone for your help.
I guess credit scoring is similar to what I am trying to do but in a medical research area. Basically, I have a dataset with 150 observations, the outcome is lung disease (yes/no) and the independent predictors include age, sex, weight, Hight, family history, genetic risk factor. I am comfortable fitting a logistic regression model. But then I was thinking of using the parameter estimates or odds ratios to build a scoring system to predict lung disease in external cases.
Ideally I thought you might be able to come up with a point based formula (such as 0 points age below 30, 1 point for age >30, 2 points age above 60, 0 points male, 1 point female etc) and them sum all the points for the different independent predictors to get a score, and have a cut off (a number like 7) above which the model predicts lung disease.
I hope this makes it easier?
Part of the question related to how to split continuous variables for such scoring system , how to assess the predictability of the scoring system and how to find the final cut off of the final sum of scores to indicate the presence of lung disease
Thanks everyone again
PaigeMiller
Diamond | Level 26

That's certainly the same as what Credit Scoring does. As I am not involved in medical research, I am not aware if this type of scoring is done in this field.

 

But ... coming  from a traditional statistics background, I always felt somewhat uneasy about the credit scoring model, as it seems to create artificial boundaries, for example if your income is above $75K you get so many points, but if it is under $75K you get fewer points, and so a person whose income is $74,999 is treated noticeably different than someone whose income is $75,001, rather than these two people being treated virtually identically. These same would hold true in a medical application, where instead of income some other variable that might makes sense in a medical study (e.g. weight) has a boundary. I still think that using the logistic regression (assuming it fits reasonably well) would be better than these artificial boundaries, and so the person who earns $74,999 and the person who earns $75,001 would have almost equal predicted probabilities.

 

I think the credit scoring model, with artificial boundaries, is accepted because it is easier for most people to understand than a logistic regression prediction, and thus I see a benefit. Nevertheless, I once asked Naeem Siddiqi about was he uncomfortable about these artificial boundaries, and he said that Credit Scoring was all about setting boundaries, and I didn't press him on this point. I can see a bank (like the one I work at) creating a boundary that they are not comfortable lending money to people whose incomes are less than X thousand dollars a year, that makes perfect sense, using business judgment to set a boundary. But data analysis creates a different kind of boundary (that I discussed above) where two people with almost identical credit (in terms of income) are treated differently. I don't see a reason for data analysis to create this boundary. And I don't see a convincing reason to create such boundaries in the medical field, but like I said, I have no experience with data analysis in the medical field.

--
Paige Miller
ammarhm
Lapis Lazuli | Level 10
I see your point. So you prefer using the predictions from the logistic model and entering continuous variables as is? How do you save the model for future use for prediction?
PaigeMiller
Diamond | Level 26

PROC LOGISTIC offers the STORE command, and the CODE command, both of which save the results of the logistic regression so that it can be applied to new data. Either method requires SAS to use, while a scoring model could be applied to new data entirely without SAS and would be relatively easy to program in any language. But really, a logistic regression model to score new data based on coefficients you provide could also be programmed up without SAS. I believe that SAS Viya and SAS Enterprise Miner can also create an API, which is a method of scoring new data entirely without SAS. And SAS can also create PMML which can share predictive models between different software.

--
Paige Miller
Reeza
Super User

Based on your response, my initial answer was correct. Logistic regression itself will generate a probability of developing lung disease. Assuming you then have the same data on individuals that you used to create the model, you could then 'score' new people and determine their probability of developing lung disease. 

 

Different ways of scoring data after your model is developed is covered here:

https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html

 

An alternative approach is a decision tree as it gives you splits more along human readable rules.

If you have access to HP proc check out this post, I'd actually recommend working through it if you can. 

https://blogs.sas.com/content/sgf/2020/08/27/build-a-decision-tree-in-sas/

 


@ammarhm wrote:
I see your point. So you prefer using the predictions from the logistic model and entering continuous variables as is? How do you save the model for future use for prediction?

Good Luck!

ammarhm
Lapis Lazuli | Level 10

Thanks,

It is disappointing that there is no simple way to export the model from SAS. Other software packages or programming platform have such capabilities. Building a model is great but it more important to put it into use and allow others to benefit from it. Maybe something to consider in the wish list?

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 2867 views
  • 9 likes
  • 5 in conversation