03-19-2013 03:27 PM
Hello everyone --
This seems like a very basic question, but I have not been able to find a simple, straight-forward answer to it anywhere. I would appreciate a clear response.
I am developing a model using PROC LOGISTIC which has one binary response variable, and four predictive variables: two continuous and two categorical (really binary). I suspect there may be an interaction between the continuous effect "vehicle speed" and the categorical effect "vehicle size". My hypothesis is that "vehicle speed" may have a different effect on accident outcome depending on the size of the vehicle involved. To test this, I initially used stepwise selection on a set of potential factors and interactions, including a nested effect of Speed(Type). This nested effect was selected as significant.
However, I began to wonder what would happen if I used an interaction instead of nesting, i.e. Speed*Type. My results are intended for use by people without any background in statistics, so using the interaction might be easier for me to explain in layman's terms. I ran that instead and stepwise analysis preferred Speed*Type over Speed(Type). Interestingly, the parameter estimates for the two options are slightly different but the rest of the model is basically identical, with no change in the AUC value or the results of the H-L test.
What I would like to know is this: When is is appropriate to use a nested variable vs. using a normal interaction term? Which would be best in this case? What accounts for the different parameter estimates for Speed*Type vs. Speed(Type)?
Thanks for your assistance!
03-20-2013 08:56 AM
In almost all of SAS parameterizations, there is no difference at all in the matrix design between nested and crossed variables. See for examplearameterization of PROC GLM Models in The GLM Procedure documentation for how the design matrix is set up. I would guess that the slight difference in parameter estimates has to do with data order and the sweep operator when inverting the design matrix.
Meanwhile, back at the stepwise selection part, I would strongly advise reading through the many topics that have addressed this in the forums, as well as Frank Harrell's text Regression Modeling Strategies, and Flom and Cassell's NESUG paper on what is wrong with stepwise as a model building tool, especially for predictive models.
03-23-2013 02:46 PM
You did not specify your MODEL statement for PROC LOGISTIC.
If you specify one effect as nested within another, this implies that the first effect does NOT have an independent ("main") effect on the dependent variable:
model y = Type + Speed(Type);
or, equivalently ,
model y = Type + Speed*Type;
since SAS parameterizes nested effects as crossed or interaction effects: Speed(Type) --> Speed*Type.
Because Speed is nested within Type, Speed is assumed to have no independent effect on Y separate from that through its association with Type.
However, if you specify one effect as interacting with another effect, this implies that both effects have independent ("main") effects on the dependent variable separate from their interaction effect on the dependent variable:
model y = Type + Speed + Speed*Type;
If the interaction effect, Speed*Type, is statistically significant, you should keep the main effects of Type and Speed even if any one or both of these main effects is not statistically significant (the so-called "hierarchy principle").
I agree with Steve Denham's advice about stepwise variable selection in regression.