BookmarkSubscribeRSS Feed
MikeTurner
Calcite | Level 5

what problem with this method?Who can tell me in details? Thanks.

5 REPLIES 5
art297
Opal | Level 21

Take a look at: http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf

And, for more similar reading, just look up stepwise and cassell in any web search.

VX_Xc
Calcite | Level 5

In stepwise regression the decisions about which variables should be included will be based upon slight differences in their semi-partial correlation. This in turn leads to the danger of over or under fitting, which may contrast with theoretical importance of a predictor.

Sounds intuitively appealling to have some procedure that automatically chooses the predictors(regressors) for you, however samples are not perfect. That is the statistical procedures assume that our sample data are perfect (no measurement error, omitted variable and stuff), hence the statistical significance obtained from the procedure assuming this perfect data will be biased(wrong). We must reason and use our brains to choose which regressors should be looked at (ideally we would want some theory to base our decision).

Also step-wise procedures to choose which regressors were to be included depends on what regressors we have in our dataset. So if we do not have any theory on which regressors should be looked at and just used stepwise procedure to select regressors then people will conclude differently depending on the regressors they have in their data.

I tend to think about stepwise procedure to be the best procedure to choose the regressors IF we have datasets with all possible variables in the world (billions and billions of them) and have the computing power that can go through these variables at multiples of lighting speed. Which is not possible now or in our life time.

SteveDenham
Jade | Level 19

Regarding the last statement: Stepwise regression would still yield biased results.  It is a matter of sampling from the population. You cannot get around it.

Now what you could do, given the abilities specified, is measure all possible variables on every individual in the population, and fit that by regression.  And watch collinearity kill the interpretation.

In my opinion, and I stress that this is only an opinion, regression is just not quite the right tool for data exploration.  It is a great tool for finding the degree of relationship for pre-specified variables.

In these days of big data, and in the days to come of even bigger data, I wonder if the whole branch of statistics that falls under "linear models" like regression, ANOVA, GLMMs, etc. will be considered the equivalent of steam power.

Steve Denham

plf515
Lapis Lazuli | Level 10

Thanks for referencing my paper, Art.

VS, as Steve pointed out, Stepwise is a bad method even if you have all the data and computing time in the world. The p values are too low, the standard errors are too small, the parameters are biased away from 0... it's not good. 

If you insist on an automatic method, Lasso or LAR is better; they are available in PROC GLMSELECT

art297
Opal | Level 21

Hi Peter!  Nice to see you here!  There have been a number of interesting questions raised on the Discussion Forums over the past year that could have benefitted from your expertise.  I'm sure there will be many more to come.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1775 views
  • 1 like
  • 5 in conversation