HI everyone,
We frequently use logistic regression or log binomial regression to fit models with binary outcome variables.
I am interested in building a scoring system based a logistic regression fitted with a number of independent variables (some continuous, some categorical).
From my reading you assign weights based on on the odds ratios of the independent predictors.
But there are many other aspect to consider: model fitting, model performance (pseudo R square?), how to best split continuous variables into categorical ones, to mention a few.
Ultimately, I am trying to use a model based on observed data to give predictions on outcomes of external data.
Could anyone please suggest a good reading on the topic or maybe give a sample code or macro on how to do this?
Thanks.
Based on your response, my initial answer was correct. Logistic regression itself will generate a probability of developing lung disease. Assuming you then have the same data on individuals that you used to create the model, you could then 'score' new people and determine their probability of developing lung disease.
Different ways of scoring data after your model is developed is covered here:
https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html
An alternative approach is a decision tree as it gives you splits more along human readable rules.
If you have access to HP proc check out this post, I'd actually recommend working through it if you can.
https://blogs.sas.com/content/sgf/2020/08/27/build-a-decision-tree-in-sas/
@ammarhm wrote:
I see your point. So you prefer using the predictions from the logistic model and entering continuous variables as is? How do you save the model for future use for prediction?
Good Luck!
Calling @Rick_SAS
Here is my paper about WOE grouping .
Attachment is famous German credit card dataset for my paper.
"Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm"
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1808-2018.pdf
"Intelligent Credit Scoring" by Naeem Siddiqi
Also, SAS has a Credit Scoring product (that might not be the exact name) written by Naeem Siddiqui based upon his book, so you wouldn't have to write your own code to do this.
From my reading you assign weights based on on the odds ratios of the independent predictors.
I'm not sure what that means. The model gives you parameter estimates that you then use to create the prediction.
You're looking for a tutorial on logistic regression in SAS?
My suggestion would be first to work through the basic examples (first 3 or 4) in the Logistic Regressio documentation to make sure your code is working and you get the same results.
Then search lexjansen.com for papers on logistic regression.
Last but not least, the first SAS Statistics e-course is free and covers logistic regression (I think, don't hold me to that but it should list it in the table of contents).
@ammarhm wrote:
HI everyone,
We frequently use logistic regression or log binomial regression to fit models with binary outcome variables.
I am interested in building a scoring system based a logistic regression fitted with a number of independent variables (some continuous, some categorical).
From my reading you assign weights based on on the odds ratios of the independent predictors.
But there are many other aspect to consider: model fitting, model performance (pseudo R square?), how to best split continuous variables into categorical ones, to mention a few.
Ultimately, I am trying to use a model based on observed data to give predictions on outcomes of external data.
Could anyone please suggest a good reading on the topic or maybe give a sample code or macro on how to do this?
Thanks.
@Reeza wrote:
From my reading you assign weights based on on the odds ratios of the independent predictors.
I'm not sure what that means. The model gives you parameter estimates that you then use to create the prediction.
You're looking for a tutorial on logistic regression in SAS?
Well, that's not my impression from reading the original post. I think the OP is talking about Credit Scoring, which involves logistic regression (or other modeling, like Genetic Algorithm, Partial Least Squares, etc.) but then another methodology is applied to model to obtain the scoring model. I think the OP is asking for help with the Credit Scoring part, not the logistic regression part.
Let's let @ammarhm clarify the scope of his question.
That's certainly the same as what Credit Scoring does. As I am not involved in medical research, I am not aware if this type of scoring is done in this field.
But ... coming from a traditional statistics background, I always felt somewhat uneasy about the credit scoring model, as it seems to create artificial boundaries, for example if your income is above $75K you get so many points, but if it is under $75K you get fewer points, and so a person whose income is $74,999 is treated noticeably different than someone whose income is $75,001, rather than these two people being treated virtually identically. These same would hold true in a medical application, where instead of income some other variable that might makes sense in a medical study (e.g. weight) has a boundary. I still think that using the logistic regression (assuming it fits reasonably well) would be better than these artificial boundaries, and so the person who earns $74,999 and the person who earns $75,001 would have almost equal predicted probabilities.
I think the credit scoring model, with artificial boundaries, is accepted because it is easier for most people to understand than a logistic regression prediction, and thus I see a benefit. Nevertheless, I once asked Naeem Siddiqi about was he uncomfortable about these artificial boundaries, and he said that Credit Scoring was all about setting boundaries, and I didn't press him on this point. I can see a bank (like the one I work at) creating a boundary that they are not comfortable lending money to people whose incomes are less than X thousand dollars a year, that makes perfect sense, using business judgment to set a boundary. But data analysis creates a different kind of boundary (that I discussed above) where two people with almost identical credit (in terms of income) are treated differently. I don't see a reason for data analysis to create this boundary. And I don't see a convincing reason to create such boundaries in the medical field, but like I said, I have no experience with data analysis in the medical field.
PROC LOGISTIC offers the STORE command, and the CODE command, both of which save the results of the logistic regression so that it can be applied to new data. Either method requires SAS to use, while a scoring model could be applied to new data entirely without SAS and would be relatively easy to program in any language. But really, a logistic regression model to score new data based on coefficients you provide could also be programmed up without SAS. I believe that SAS Viya and SAS Enterprise Miner can also create an API, which is a method of scoring new data entirely without SAS. And SAS can also create PMML which can share predictive models between different software.
Based on your response, my initial answer was correct. Logistic regression itself will generate a probability of developing lung disease. Assuming you then have the same data on individuals that you used to create the model, you could then 'score' new people and determine their probability of developing lung disease.
Different ways of scoring data after your model is developed is covered here:
https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html
An alternative approach is a decision tree as it gives you splits more along human readable rules.
If you have access to HP proc check out this post, I'd actually recommend working through it if you can.
https://blogs.sas.com/content/sgf/2020/08/27/build-a-decision-tree-in-sas/
@ammarhm wrote:
I see your point. So you prefer using the predictions from the logistic model and entering continuous variables as is? How do you save the model for future use for prediction?
Good Luck!
Thanks,
It is disappointing that there is no simple way to export the model from SAS. Other software packages or programming platform have such capabilities. Building a model is great but it more important to put it into use and allow others to benefit from it. Maybe something to consider in the wish list?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.