BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AvocadoRivalry
Calcite | Level 5

This is more of a conceptual question than a procedural question, so I apologize if this is not the appropriate place.  My understanding of regression is somewhat limited, so I'm hoping that some discussion here can give me a clearer understanding of the PROC Logistic output and ultimately help me pick the ideal model combination I'm looking for.

I am in the process of running logistic regression on a series of combinations to best predict an event (event = 1, no event = 0).  Each combination can consider a variety of predictors (somewhere between 3 and 6), each of which have been binned based on previous analysis and expert judgment, resulting in a regression model that considers several polytomous categorical variables.  My understanding is that in the Maximum Likelihood Estimates section of the output, the estimate is essentially the coefficient which reflects how the variable (or in this case, each bin) is related to the null hypothesis.  I was under the impression that this null hypothesis was essentially the likelihood of an event occurring across all observations, ignoring any variables/bins (e.g.; if I have 1,000 observations, 100 of which are events, the null hypothesis would state that the probability of event occurring is 10%).  The regression would then base the coefficients for each variable and bin in relation to this percentage.  It appears that SAS does things differently though.  Instead of using the entire population to define the control bin, SAS defaults to using the first bin I have defined (which is apparent in the output, where this bin does not show up and the Odds Ratio Estimates show the comparison of each bin vs. this control bin). In many cases, this bin is made up of the missing values, which varies greatly in size and proportion to the overall population depending on which variable is being analyzed. 

My question is two-fold: First, why is the first bin defined used as the control/null hypothesis.  Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway.  Maybe my understanding of the null hypothesis and the value on which coefficients are based is incorrect, so any clarification there would help.  Secondly, assuming that the control bin does have to be assigned and cannot be the entire population, is there a better approach to account for the variety in size and proportion of the missing value buckets?  Instead of using this as the control bin, what is a better grouping to use?

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I think the following feature was introduced in version 9.2. You can specify the type of parameterization that you want for your CLASS variables and choose a specific level as a reference. Check

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introcom_a00...

or look for the same topic in version 9.3 if that is what you have access to.

PG

PG

View solution in original post

2 REPLIES 2
PGStats
Opal | Level 21

I think the following feature was introduced in version 9.2. You can specify the type of parameterization that you want for your CLASS variables and choose a specific level as a reference. Check

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introcom_a00...

or look for the same topic in version 9.3 if that is what you have access to.

PG

PG
Doc_Duke
Rhodochrosite | Level 12

To your 'why' question, "My question is two-fold: First, why is the first bin defined used as the control/null hypothesis.  Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway.", different disciplines use different approaches.  SAS gives you the control to decide your preference.

In medicine, for instance,there is often a natural interpretation to using the "reference cell" parameterization.

  

Doc Muhlbaier

Duke

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 4933 views
  • 0 likes
  • 3 in conversation