Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Interpreting PROC Logistic Output - Understanding ML and Coefficients

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 03-07-2012 11:49 AM
(5051 views)

This is more of a conceptual question than a procedural question, so I apologize if this is not the appropriate place. My understanding of regression is somewhat limited, so I'm hoping that some discussion here can give me a clearer understanding of the PROC Logistic output and ultimately help me pick the ideal model combination I'm looking for.

I am in the process of running logistic regression on a series of combinations to best predict an event (event = 1, no event = 0). Each combination can consider a variety of predictors (somewhere between 3 and 6), each of which have been binned based on previous analysis and expert judgment, resulting in a regression model that considers several polytomous categorical variables. My understanding is that in the Maximum Likelihood Estimates section of the output, the estimate is essentially the coefficient which reflects how the variable (or in this case, each bin) is related to the null hypothesis. I was under the impression that this null hypothesis was essentially the likelihood of an event occurring across all observations, ignoring any variables/bins (e.g.; if I have 1,000 observations, 100 of which are events, the null hypothesis would state that the probability of event occurring is 10%). The regression would then base the coefficients for each variable and bin in relation to this percentage. It appears that SAS does things differently though. Instead of using the entire population to define the control bin, SAS defaults to using the first bin I have defined (which is apparent in the output, where this bin does not show up and the Odds Ratio Estimates show the comparison of each bin vs. this control bin). In many cases, this bin is made up of the missing values, which varies greatly in size and proportion to the overall population depending on which variable is being analyzed.

My question is two-fold: First, why is the first bin defined used as the control/null hypothesis. Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway. Maybe my understanding of the null hypothesis and the value on which coefficients are based is incorrect, so any clarification there would help. Secondly, assuming that the control bin does have to be assigned and cannot be the entire population, is there a better approach to account for the variety in size and proportion of the missing value buckets? Instead of using this as the control bin, what is a better grouping to use?

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think the following feature was introduced in version 9.2. You can specify the type of parameterization that you want for your CLASS variables and choose a specific level as a reference. Check

or look for the same topic in version 9.3 if that is what you have access to.

PG

PG

2 REPLIES 2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think the following feature was introduced in version 9.2. You can specify the type of parameterization that you want for your CLASS variables and choose a specific level as a reference. Check

or look for the same topic in version 9.3 if that is what you have access to.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To your 'why' question, "My question is two-fold: First, why is the first bin defined used as the control/null hypothesis. Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway.", different disciplines use different approaches. SAS gives you the control to decide your preference.

In medicine, for instance,there is often a natural interpretation to using the "reference cell" parameterization.

Doc Muhlbaier

Duke

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.