06-03-2014 06:06 PM
This is my first time to use glmselect with lasso options. However the procedure ends very quickly, always 2 steps. I changed the STOP options but no luck. And the result is really bad, R^2 is below 0.3. Don't understand why it just stops.
I have more than 200 IV and only 1 DV (50 records).
Thanks for you input.
06-04-2014 08:58 AM
Any messages in the log, or in the output? It may be that your 200 IV are highly correlated, and so only two steps are needed to find an optimal set. However, it is hard to tell without more information.
06-04-2014 10:11 AM
if I specify selection=lasso(stop=ADJRSQ); then SAS stop in 2 steps and show:
Selection stopped at a local maximum of the AdjRSq criterion.
If I specify selection=lasso(stop=SBC);then SAS stop in 2 steps and show:
Selection stopped at a local minimum of the SBC criterion.
I only get 2 variables. The AdjRSq is pretty low in either test unless I specify steps=20. With STEPS option, the AdjRSq increases, However the purpose of using lasso is to avoid overfitting. I look at the variables and I believe STEPS is giving me overfitting result.
Look at the correlation between those variables, don't believe all of them are strongly correlated.
Thanks for your help
06-04-2014 01:22 PM
This got me thinking a little bit. I used the example in the SAS/STAT 13.1 documentation, with changes. First, I ran:
proc glmselect data=sashelp.Leutrain plots=coefficients;
model y = x1-x7129/
This stopped after four steps. Then I ran:
proc glmselect data=sashelp.Leutrain /*valdata=sashelp.Leutest*/
model y = x1-x7129/
And this went out the full 20 steps, with the optimal value at step 20. OK, what happened when I did not include the steps= option? Well, the adjRsq criterion actually went down with the inclusion of the fifth predictor, and thus, the procedure stops, with an adjusted Rsq of 0.6132. I think this is what is happening with your data. I can get all sorts of answers from this dataset, based on a combination of options.
My personal preferences might be to minimize PRESS, rather than maximizing adjusted Rsquare or minimizing information criteria, especially if I were trying to build a predictive model.