BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DomUk
Fluorite | Level 6

Hi there,

I have two questions regarding the application of Lasso, using the cross validation method (cvex).

In my case, I run pooled cross-sectional rolling regressions, using 5 years of data, to forecast Earnings with 12 variables.

In order to check if all variables are relevant, I use the lasso statement for each single regression of the rolling window (together 27 regressions, 1992-2018). For example to estimate the coefficients for year 2011 the following code is used:

proc glmselect data=mylib.earning plots=all seed=123;
where 2007<= Year <=2011;
model Earning(t+1)= x1(t)+...+ x12(t)
/selection=LASSO (stop=none choose=cvex);
run;

 

For most of the 27 regressions, all variables keep in the model, but sometimes the coefficients are different from the OLS-Regression. Should the coefficients not be equal to the OLS-solution, because there is no penalty?

 

2) Very often, some coefficents in the lasso-output are zero, but they are not excluded from the model. So the variables are listed in the output (parameter estimates), but they have the value zero. Why are these variables not excluded from the model?

 

Thank you very much for an answer.

Best regards

Dom

1 ACCEPTED SOLUTION

Accepted Solutions
STAT_Kathleen
SAS Employee

I have been able to replicate the behavior of zero coefficient included in the final model using LASSO selection.

In my statistical opinion, if the zero coefficients appear in the intermediate steps, it would be quite reasonable and okay. However, there should be no zero coefficients in the final selected model.

 

I have informed our developers of this particular behavior and we are currently researching this particular issue. I will update you once I have more specific details. 

 

View solution in original post

8 REPLIES 8
STAT_Kathleen
SAS Employee

Dom,

 

Could you provide data set so we can take a closer look at your application with the results you reported?

 

I would expect as all the variables enter into the model  and in this case, the output of LASSO reduces to OLS, as, t, the parameter in the LASSO formulation (see the doc for GLMSELECT), can be thought as infinity.

 

I attached code that simulates a data set where are the variables enter in model and LASSO estimates reduce to the OLS estimates.

 

 

DomUk
Fluorite | Level 6

Thank you a lot for the answer.

Unfortunaly Iam not allowed to share the data set because of data protection.

Most time when sas tells me that all variables should stay in the model, the result is identical to OLS (PROC Reg), and so everything is fine. Only in some cases this is not the case and I noticed that in these cases variables are stayed in the model where the coefficient is zero. So SAS tells me I should keep the variable (it is not excluded from the Output "Paramter estimates") but the value is zero. This is very confusing to me.

 

DomUk
Fluorite | Level 6

Here is the output and my code. Maybe this is helpful.

Thank you!

SteveDenham
Jade | Level 19

This note could be extremely useful:

https://support.sas.com/kb/60/240.html 

 

It shows what to do with the zero values obtained for parameters under LASSO.

 

SteveDenham

DomUk
Fluorite | Level 6

Tank you for the answer but which part do you concrete mean? I find no information that fit to my problem

STAT_Kathleen
SAS Employee

I have been able to replicate the behavior of zero coefficient included in the final model using LASSO selection.

In my statistical opinion, if the zero coefficients appear in the intermediate steps, it would be quite reasonable and okay. However, there should be no zero coefficients in the final selected model.

 

I have informed our developers of this particular behavior and we are currently researching this particular issue. I will update you once I have more specific details. 

 

DomUk
Fluorite | Level 6
thank you very much!
DomUk
Fluorite | Level 6

Hi Kathleen,

I have one more question:

i would like to use lasso application to exclude not important variables.

concrete, I would like to run the regression on 5 years on data (2000-2004) and validate it on the year 2005. My dataset contains years from 1980-2020, so do you have an idea how i could handle this? I tried to safe all data from 2005 in a new dataset, but it doesnt work. I think the starting point is something like this

proc glmselect data=mylib. dataset plots=all seed=123 valdata= ??? ;
where 2000 <= year <= 2004 ;
model y= x1........x100
/selection= lasso (stop=none choose=validate);
ods output parameterestimates= check_lasso_parms;
run;

 

Thanks a lot for an answer

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1533 views
  • 1 like
  • 3 in conversation