BookmarkSubscribeRSS Feed
neilxu
Calcite | Level 5

This is my first time to use glmselect with lasso options. However the procedure ends very quickly, always 2 steps. I changed the STOP options but no luck. And the result is really bad, R^2 is below 0.3. Don't understand why it just stops.

I have more than 200 IV and only 1 DV (50 records).


Thanks for you input.

3 REPLIES 3
SteveDenham
Jade | Level 19

Any messages in the log, or in the output?  It may be that your 200 IV are highly correlated, and so only two steps are needed to find an optimal set.  However, it is hard to tell without more information.

Steve Denham

neilxu
Calcite | Level 5

if I specify selection=lasso(stop=ADJRSQ); then SAS stop in 2 steps and show:

Selection stopped at a local maximum of the AdjRSq criterion.

If I specify selection=lasso(stop=SBC);then SAS stop in 2 steps and show:

Selection stopped at a local minimum of the SBC criterion.

I only get 2 variables. The AdjRSq is pretty low in either test unless I specify steps=20. With STEPS option, the AdjRSq increases, However the purpose of using lasso is to avoid overfitting. I look at the variables and I believe STEPS is giving me overfitting result.

Look at the correlation between those variables, don't believe all of them are strongly correlated.

Thanks for your help

SteveDenham
Jade | Level 19

This got me thinking a little bit.  I used the example in the SAS/STAT 13.1 documentation, with changes.  First, I ran:

 

proc glmselect data=sashelp.Leutrain plots=coefficients;

model y = x1-x7129/

selection=LASSO(choose=adjrsq);

run;

This stopped after four steps.  Then I ran:

proc glmselect data=sashelp.Leutrain /*valdata=sashelp.Leutest*/

plots=coefficients;

model y = x1-x7129/

selection=LASSO(choose=adjrsq steps=20);

run;

And this went out the full 20 steps, with the optimal value at step 20.  OK, what happened when I did not include the steps= option? Well, the adjRsq criterion actually went down with the inclusion of the fifth predictor, and thus, the procedure stops, with an adjusted Rsq of 0.6132.  I think this is what is happening with your data.  I can get all sorts of answers from this dataset, based on a combination of options.

My personal preferences might be to minimize PRESS, rather than maximizing adjusted Rsquare or minimizing information criteria, especially if I were trying to build a predictive model.

Steve Denham

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1990 views
  • 0 likes
  • 2 in conversation