Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Re: Interpreting 'Predicted' in Proc Logistic -- What's your take?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 03-10-2022 06:50 PM
(1158 views)

Today's question is about interpreting. Please see the following table:

This table was created by Proc Logistic. Model is to predict i_50505_Z. There are around 100 independent variables (not shown).

The 'Probability' has to do with 'Odds Ratio' -- the odds of arriving at i_50505_Z = 1, versus i_50505_Z = 0. The column is sorted with highest Probability at top.

I find this table to be fascinating, if I'm interpreting it correctly. It 'scores' every single observation in the entire dataset.

**Looking across the independent variables, IFF a particular variable is found to be 'significant' (from other not shown tables), and the Probability is shown to be high, the value for that independent variable observation is the best to choose for arriving at the desired dependent variable target.**

Example:

You're targeting i_50505_Z to be in the top 10% (i.e., i_50505_Z = 1). Probability is 0.922 for a particular observation. Independent variable X1 (p<.001) for that observation is (say) 4.7. Then X1=4.7 is a pretty darn good guess for arriving at your objective.

In other words, given a circumstance where you see X1=4.7 you are highly likely to find high values of i_50505_Z.

Discovering this is the precise purpose of the statistical analysis. If the program can't do it, ask for a refund.

Please share your thoughts and clarifications.

Nicholas Kormanik

p.s. -- a side issue. Notice that in the i_50505_Z column is 0.39. Such value is certainly not among the top. Really curious how that got to be included.

12 REPLIES 12

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Looking across the independent variables, IFF a particular variable is found to be 'significant' (from other not shown tables), and the Probability is shown to be high, the value for that independent variable observation is the best to choose for arriving at the desired dependent variable target.

I think this is way off the mark. Probabilities shown refer to how the model "scores" (or predicts) a single observation, using all independent variables. It doesn't tell you anything about what variables are most predictive or what variables are most significant (which is not the same as most predictive), because each independent variable has an effect on the predictions of ALL observations. Probabilities tell you nothing about variables.

There are around 100 independent variables (not shown).

Generally not a good thing to put 100 independent variables into a Logistic regression model. This produces problems caused by multi-collinearity between the independent variables (in other words, the independent variables are correlated with each other) and this will cause the regression coefficients can have huge variances (meaning the model can be quite unstable) and even have the wrong sign. There's plenty of reading on the internet about multi-collinearity.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks @PaigeMiller. Well, then, given the information in the table at top, particularly Probability, how might one use this information?

The challenge is: We seek a high Y outcome, we have Xi - Xn, how can the table above help us achieve our objective?

(note: Multicollinearity is supposed to be less of a problem with Logistic Regression, than it is with Linear Regression.)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@NKormanik wrote:

Well, then, given the information in the table at top, particularly Probability, how might one use this information?

The challenge is: We seek a high Y outcome, we have Xi - Xn, how can the table above help us achieve our objective?

How can we answer these questions? We don't know what your objective is. What does "high Y outcome" mean? Is it highest predicted value?

(note: Multicollinearity is supposed to be less of a problem with Logistic Regression, than it is with Linear Regression.)

I disagree. I have seen logistic regression report signs on the coefficients that are opposite what a univariate regression would show.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hate it when the procedure outputs disagree.

I'm certain my datasets would violate every caution and assumption OLS requires.

Quite likely Logistic will end up a bust as well.

That is, unless you and others can shed some light.....

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@NKormanik wrote:

Hate it when the procedure outputs disagree.

I'm certain my datasets would violate every caution and assumption OLS requires.

Quite likely Logistic will end up a bust as well.

That is, unless you and others can shed some light.....

Unknown what you mean by any of the above.

I don't know what you mean by "procedure outputs disagree". Be specific

I don't know what assumptions are violated. Be specific.

Logistic regression predicts log odds ratios. You haven't really stated the goal of this modeling effort. Do you want to determine which variables are important? Do you want to determine which observations have high (or low) predicted probabilities? Both? Neither? Something else? How do you want to use this logistic regression?

But you want us to "shed some light", without you first clearly stating your goal, what it means. I can't do that.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I concur with @PaigeMiller

I would suggest using a random forest if you want to make those types of statements, logistic regression doesn't really provide that type of interpretation easily.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Reeza, well, that there are other statistical tools available that can help solve the problem is hugely encouraging. Absolutely for sure.

Presently, however, I'm attempting to gain **something** from Proc Logistic. Like, **anything**.

When it becomes totally apparent that Proc Logistic is a bust, then I'll move on.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

IMO it isn't common to focus on the scoring output very much. Instead, you typically look at the parameter estimates to see which ones affect your outcome the most - typically using odds ratio plots.

https://blogs.sas.com/content/iml/2015/07/29/or-plots-log-scale.html

I'd probably recommend starting there, and then pruning your model by removing variables that are not significant or don't seem to have much of an impact (significance and effect size are different things). Then you can make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.

https://blogs.sas.com/content/iml/2015/07/29/or-plots-log-scale.html

I'd probably recommend starting there, and then pruning your model by removing variables that are not significant or don't seem to have much of an impact (significance and effect size are different things). Then you can make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Reeza wrote:

make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.

Precisely. That's what I've been trying to do. Stepwise regression is supposed to eliminate the variables that are not meaningful.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@NKormanik wrote:

@Reeza wrote:

make rules such as : if X2, X3, X4 are high this person is likely to be in the high performing group.Precisely. That's what I've been trying to do. Stepwise regression is supposed to eliminate the variables that are not meaningful.

Then you need to look at **the parameter estimates and the odds ratio**, not the output you're currently examining.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@pink_poodle wrote:

The observation with 92% correctness of prediction had to satisfy the criteria of many of the 100 predictors, not just X1.

I'm interpreting that single observation to be golden -- like a nugget of solid gold among the rock rubble.

True, not only X1 to take note of, but all the 'significant' variables Proc Logistic has come up with. Say: X1, X7, X33, X49, X81.

Those all make up the **Super Team**. Take extra special note of 'em.

When those babies line up, you're pretty safe to bet big.

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.