I see that interpretation of variable coefficients is problematic with Logistic Regression. No simple matter, by far.
Wondering if it's advisable to hold off on including the intercept. Interpretation of which also is an issue.
Any thoughts greatly appreciated.
Nicholas Kormanik
There are only rare cases in modeling where leaving the intercept out is a good idea. Generally, the advice is to include the intercept, the model will fit better; and you leave the intercept out only with rock-solid justification.
An example:
You put a certain amount of liquid soap in a dish of water. Then you agitate the water and measure the suds created. It at first sounds like if you were to a regression of amount of suds compared to amount of liquid soap (and keeping the agitation constant), you might think no intercept is needed, as zero soap produces zero suds. WRONG! In the region of the data, the fit line is not sloping towards the origin, it probably has a different slope and doesn't go through the origin if you project it backwards to the origin. This fit in the region of data is a better fit to the data than a fit with no intercept, and it really shouldn't matter that if you extrapolate to zero suds you don't get the expected value, extrapolation shouldn't force a fit — and furthermore, if you really want a good fitting line through the origin, it probably shouldn't be linear and may not fit well elsewhere.
Also, please note: there is a difference between EMPIRICAL modeling, which all regressions are, and strives to fit the data well; and first principles modeling, based upon some scientific or other knowledge. In my opinion, it is very difficult to combine EMPIRICAL modeling and first principles modeling and achieve good fitting models based on both concepts. Even in the soap suds example, it's hard to achieve both goals. Logistic regression and all regressions only try to fit the existing data well; it has no other goal.
There are only rare cases in modeling where leaving the intercept out is a good idea. Generally, the advice is to include the intercept, the model will fit better; and you leave the intercept out only with rock-solid justification.
An example:
You put a certain amount of liquid soap in a dish of water. Then you agitate the water and measure the suds created. It at first sounds like if you were to a regression of amount of suds compared to amount of liquid soap (and keeping the agitation constant), you might think no intercept is needed, as zero soap produces zero suds. WRONG! In the region of the data, the fit line is not sloping towards the origin, it probably has a different slope and doesn't go through the origin if you project it backwards to the origin. This fit in the region of data is a better fit to the data than a fit with no intercept, and it really shouldn't matter that if you extrapolate to zero suds you don't get the expected value, extrapolation shouldn't force a fit — and furthermore, if you really want a good fitting line through the origin, it probably shouldn't be linear and may not fit well elsewhere.
Also, please note: there is a difference between EMPIRICAL modeling, which all regressions are, and strives to fit the data well; and first principles modeling, based upon some scientific or other knowledge. In my opinion, it is very difficult to combine EMPIRICAL modeling and first principles modeling and achieve good fitting models based on both concepts. Even in the soap suds example, it's hard to achieve both goals. Logistic regression and all regressions only try to fit the existing data well; it has no other goal.
So nicely explained, @PaigeMiller. Really appreciate it.
If you are fitting only a single factor and using GLM parameterization, removing the intercept will give values for each level of the factor in the solution vector. If there is a second (or more) factor(s), this 'trick' doesn't help a bit. As @PaigeMiller says, you are far better off including the intercept. You can use LSMEANS to get response level values. SAS does all of the necessary combining of parameters to get the LSMEANS.
SteveDenham
@SteveDenham wrote:
If you are fitting only a single factor and using GLM parameterization, removing the intercept will give values for each level of the factor in the solution vector.
I assume this refers to class variables, in which case I agree. The original post did not indicate if the x-variables are class or continuous.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.