Today's question is about interpreting. Please see the following table:
This table was created by Proc Logistic. Model is to predict i_50505_Z. There are around 100 independent variables (not shown).
The 'Probability' has to do with 'Odds Ratio' -- the odds of arriving at i_50505_Z = 1, versus i_50505_Z = 0. The column is sorted with highest Probability at top.
I find this table to be fascinating, if I'm interpreting it correctly. It 'scores' every single observation in the entire dataset.
Looking across the independent variables, IFF a particular variable is found to be 'significant' (from other not shown tables), and the Probability is shown to be high, the value for that independent variable observation is the best to choose for arriving at the desired dependent variable target.
Example:
You're targeting i_50505_Z to be in the top 10% (i.e., i_50505_Z = 1). Probability is 0.922 for a particular observation. Independent variable X1 (p<.001) for that observation is (say) 4.7. Then X1=4.7 is a pretty darn good guess for arriving at your objective.
In other words, given a circumstance where you see X1=4.7 you are highly likely to find high values of i_50505_Z.
Discovering this is the precise purpose of the statistical analysis. If the program can't do it, ask for a refund.
Please share your thoughts and clarifications.
Nicholas Kormanik
p.s. -- a side issue. Notice that in the i_50505_Z column is 0.39. Such value is certainly not among the top. Really curious how that got to be included.
Looking across the independent variables, IFF a particular variable is found to be 'significant' (from other not shown tables), and the Probability is shown to be high, the value for that independent variable observation is the best to choose for arriving at the desired dependent variable target.
I think this is way off the mark. Probabilities shown refer to how the model "scores" (or predicts) a single observation, using all independent variables. It doesn't tell you anything about what variables are most predictive or what variables are most significant (which is not the same as most predictive), because each independent variable has an effect on the predictions of ALL observations. Probabilities tell you nothing about variables.
There are around 100 independent variables (not shown).
Generally not a good thing to put 100 independent variables into a Logistic regression model. This produces problems caused by multi-collinearity between the independent variables (in other words, the independent variables are correlated with each other) and this will cause the regression coefficients can have huge variances (meaning the model can be quite unstable) and even have the wrong sign. There's plenty of reading on the internet about multi-collinearity.
Thanks @PaigeMiller. Well, then, given the information in the table at top, particularly Probability, how might one use this information?
The challenge is: We seek a high Y outcome, we have Xi - Xn, how can the table above help us achieve our objective?
(note: Multicollinearity is supposed to be less of a problem with Logistic Regression, than it is with Linear Regression.)
@NKormanik wrote:
Well, then, given the information in the table at top, particularly Probability, how might one use this information?
The challenge is: We seek a high Y outcome, we have Xi - Xn, how can the table above help us achieve our objective?
How can we answer these questions? We don't know what your objective is. What does "high Y outcome" mean? Is it highest predicted value?
(note: Multicollinearity is supposed to be less of a problem with Logistic Regression, than it is with Linear Regression.)
I disagree. I have seen logistic regression report signs on the coefficients that are opposite what a univariate regression would show.
Hate it when the procedure outputs disagree.
I'm certain my datasets would violate every caution and assumption OLS requires.
Quite likely Logistic will end up a bust as well.
That is, unless you and others can shed some light.....
@NKormanik wrote:
Hate it when the procedure outputs disagree.
I'm certain my datasets would violate every caution and assumption OLS requires.
Quite likely Logistic will end up a bust as well.
That is, unless you and others can shed some light.....
Unknown what you mean by any of the above.
I don't know what you mean by "procedure outputs disagree". Be specific
I don't know what assumptions are violated. Be specific.
Logistic regression predicts log odds ratios. You haven't really stated the goal of this modeling effort. Do you want to determine which variables are important? Do you want to determine which observations have high (or low) predicted probabilities? Both? Neither? Something else? How do you want to use this logistic regression?
But you want us to "shed some light", without you first clearly stating your goal, what it means. I can't do that.
I concur with @PaigeMiller
I would suggest using a random forest if you want to make those types of statements, logistic regression doesn't really provide that type of interpretation easily.
@Reeza, well, that there are other statistical tools available that can help solve the problem is hugely encouraging. Absolutely for sure.
Presently, however, I'm attempting to gain something from Proc Logistic. Like, anything.
When it becomes totally apparent that Proc Logistic is a bust, then I'll move on.
@Reeza wrote:
make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.
Precisely. That's what I've been trying to do. Stepwise regression is supposed to eliminate the variables that are not meaningful.
@NKormanik wrote:
@Reeza wrote:
make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.Precisely. That's what I've been trying to do. Stepwise regression is supposed to eliminate the variables that are not meaningful.
Then you need to look at the parameter estimates and the odds ratio, not the output you're currently examining.
This is a post-factual table that scores the observations according to the specified model. There is no differential treatment of observations before standard interrogation that follows a set of rules prescribed by the model. The observation with 92% correctness of prediction had to satisfy the criteria of many of the 100 predictors, not just X1. About seemingly out-of-place value 0.39, values do not have to be in order of magnitude, the probability depends on their circumstances.
@pink_poodle wrote:
The observation with 92% correctness of prediction had to satisfy the criteria of many of the 100 predictors, not just X1.
I'm interpreting that single observation to be golden -- like a nugget of solid gold among the rock rubble.
True, not only X1 to take note of, but all the 'significant' variables Proc Logistic has come up with. Say: X1, X7, X33, X49, X81.
Those all make up the Super Team. Take extra special note of 'em.
When those babies line up, you're pretty safe to bet big.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.