03-07-2012 11:49 AM
This is more of a conceptual question than a procedural question, so I apologize if this is not the appropriate place. My understanding of regression is somewhat limited, so I'm hoping that some discussion here can give me a clearer understanding of the PROC Logistic output and ultimately help me pick the ideal model combination I'm looking for.
I am in the process of running logistic regression on a series of combinations to best predict an event (event = 1, no event = 0). Each combination can consider a variety of predictors (somewhere between 3 and 6), each of which have been binned based on previous analysis and expert judgment, resulting in a regression model that considers several polytomous categorical variables. My understanding is that in the Maximum Likelihood Estimates section of the output, the estimate is essentially the coefficient which reflects how the variable (or in this case, each bin) is related to the null hypothesis. I was under the impression that this null hypothesis was essentially the likelihood of an event occurring across all observations, ignoring any variables/bins (e.g.; if I have 1,000 observations, 100 of which are events, the null hypothesis would state that the probability of event occurring is 10%). The regression would then base the coefficients for each variable and bin in relation to this percentage. It appears that SAS does things differently though. Instead of using the entire population to define the control bin, SAS defaults to using the first bin I have defined (which is apparent in the output, where this bin does not show up and the Odds Ratio Estimates show the comparison of each bin vs. this control bin). In many cases, this bin is made up of the missing values, which varies greatly in size and proportion to the overall population depending on which variable is being analyzed.
My question is two-fold: First, why is the first bin defined used as the control/null hypothesis. Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway. Maybe my understanding of the null hypothesis and the value on which coefficients are based is incorrect, so any clarification there would help. Secondly, assuming that the control bin does have to be assigned and cannot be the entire population, is there a better approach to account for the variety in size and proportion of the missing value buckets? Instead of using this as the control bin, what is a better grouping to use?
Thanks in advance!
03-07-2012 12:19 PM
I think the following feature was introduced in version 9.2. You can specify the type of parameterization that you want for your CLASS variables and choose a specific level as a reference. Check
or look for the same topic in version 9.3 if that is what you have access to.
03-07-2012 09:19 PM
To your 'why' question, "My question is two-fold: First, why is the first bin defined used as the control/null hypothesis. Intuitively, it would make more sense to me to have this defined as the entire population rather than comparing against one of the bins that is defined to ultimately be used in the model anyway.", different disciplines use different approaches. SAS gives you the control to decide your preference.
In medicine, for instance,there is often a natural interpretation to using the "reference cell" parameterization.