02-12-2012 09:26 AM
Take a look at: http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf
And, for more similar reading, just look up stepwise and cassell in any web search.
02-16-2012 07:34 PM
In stepwise regression the decisions about which variables should be included will be based upon slight differences in their semi-partial correlation. This in turn leads to the danger of over or under fitting, which may contrast with theoretical importance of a predictor.
Sounds intuitively appealling to have some procedure that automatically chooses the predictors(regressors) for you, however samples are not perfect. That is the statistical procedures assume that our sample data are perfect (no measurement error, omitted variable and stuff), hence the statistical significance obtained from the procedure assuming this perfect data will be biased(wrong). We must reason and use our brains to choose which regressors should be looked at (ideally we would want some theory to base our decision).
Also step-wise procedures to choose which regressors were to be included depends on what regressors we have in our dataset. So if we do not have any theory on which regressors should be looked at and just used stepwise procedure to select regressors then people will conclude differently depending on the regressors they have in their data.
I tend to think about stepwise procedure to be the best procedure to choose the regressors IF we have datasets with all possible variables in the world (billions and billions of them) and have the computing power that can go through these variables at multiples of lighting speed. Which is not possible now or in our life time.
02-17-2012 08:16 AM
Regarding the last statement: Stepwise regression would still yield biased results. It is a matter of sampling from the population. You cannot get around it.
Now what you could do, given the abilities specified, is measure all possible variables on every individual in the population, and fit that by regression. And watch collinearity kill the interpretation.
In my opinion, and I stress that this is only an opinion, regression is just not quite the right tool for data exploration. It is a great tool for finding the degree of relationship for pre-specified variables.
In these days of big data, and in the days to come of even bigger data, I wonder if the whole branch of statistics that falls under "linear models" like regression, ANOVA, GLMMs, etc. will be considered the equivalent of steam power.
02-21-2012 05:51 PM
Thanks for referencing my paper, Art.
VS, as Steve pointed out, Stepwise is a bad method even if you have all the data and computing time in the world. The p values are too low, the standard errors are too small, the parameters are biased away from 0... it's not good.
If you insist on an automatic method, Lasso or LAR is better; they are available in PROC GLMSELECT
02-21-2012 06:04 PM
Hi Peter! Nice to see you here! There have been a number of interesting questions raised on the Discussion Forums over the past year that could have benefitted from your expertise. I'm sure there will be many more to come.