BookmarkSubscribeRSS Feed
DomUk
Fluorite | Level 6

Hi all,

i would like to use lasso application to exclude not important variables.

I would like to run the regression on 5 years on data (2000-2004) and validate it on the year 2005. My dataset contains years from 1980-2020, so does anyone have an idea how i could handle this? I tried to safe all data from 2005 in a new dataset, but it doesnt work. I think the starting point is something like this

proc glmselect data=mylib. dataset plots=all seed=123 valdata= ??? ;
where 2000 <= year <= 2004 ;
model y= x1........x100
/selection= lasso (stop=none choose=validate);
ods output parameterestimates= check_lasso_parms;
run;

 

Thanks a lot for an answer

5 REPLIES 5
sbxkoenk
SAS Super FREQ

How comes you cannot save your 2005 data in a separate dataset?

 

All you should do is this:

 

data want;
 set have;
 where year(your_date_var)=2005;
run;

Take care there is a substantial difference between validation data (VALDATA=) and test data (TESTDATA=).

 

Also for more info on LASSO, I advise this paper:

SAS Global Forum 2020
Paper SAS4287-2020
A Survey of Methods in Variable Selection and Penalized Regression
Yingwei Wang, SAS Institute Inc.

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4287-2020.pdf

 

 

Cheers,

Koen

DomUk
Fluorite | Level 6
Sry maybe I was not clear enough. I can safe the data in a seperate dataset, that is not the problem. The problem is that it does not work in the lasso-code. The valdata=... step does not work
sbxkoenk
SAS Super FREQ

Hello,

I would be astonished if VALDATA= does not work when used appropriately.

In cases as this, it's always best to include the LOG.

Please include the LOG by using the 'Insert Code' icon (</>) above your entry, that way the LOG does not loose structure and formatting.

Thanks,

Koen

 

DomUk
Fluorite | Level 6
Thanks for your answer. I decided to work with cross validation, but Iam still interested what the problem is.
The log statement is the following:

2 Data test;
3 set mylibf1.endversion;
4 where houyear=2005;
5 run;

NOTE: There were 1859 observations read from the data set MYLIBF1.ENDVERSION.
WHERE houyear=2005;
NOTE: The data set WORK.TEST has 1859 observations and 29 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.04 seconds


6 proc glmselect data=mylibf1.endversion plots=all seed=123 valdata=test;
NOTE: Writing HTML Body file: sashtml.htm
7 where 2000 <= Houyear <= 2004;
8 model F1_Earn_s_t= BV_s_t negE negEE_s_t c_sales_s_t c_cogs_s_t c_oe_s_t c_int_s_t c_tax_s_t
8 ! c_other_s_t del_ar_s_t del_inv_s_t del_ap_s_t depr_s_t amort_s_t oth_acc_s_t
9 /selection= lasso (stop=none choose=validate) ;
10 ods output parameterestimates= test;
11 run;

ERROR: Selection aborted as there are no suitable observations for validation.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: Output 'parameterestimates' was not created. Make sure that the output object name,
label, or path is spelled correctly. Also, verify that the appropriate procedure options
are used to produce the requested output object. For example, verify that the NOPRINT
option is not used.
NOTE: There were 7513 observations read from the data set MYLIBF1.ENDVERSION.
WHERE (Houyear>=2000 and Houyear<=2004);
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 0.99 seconds
cpu time 0.25 seconds


12 Data test;
13 set mylibf1.endversion;
14 where houyear=2005;
15 run;

ERROR: You cannot open WORK.TEST.DATA for output access with member-level control because
WORK.TEST.DATA is in use by you in resource environment ViewTable Window.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds


16 Data test;
17 set mylibf1.endversion;
18 where houyear=2005;
19 run;

NOTE: There were 1859 observations read from the data set MYLIBF1.ENDVERSION.
WHERE houyear=2005;
NOTE: The data set WORK.TEST has 1859 observations and 29 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds


20 proc glmselect data=mylibf1.endversion plots=all seed=123 valdata=test;
21 where 2000 <= Houyear <= 2004;
22 model F1_Earn_s_t= BV_s_t negE negEE_s_t c_sales_s_t c_cogs_s_t c_oe_s_t c_int_s_t c_tax_s_t
22 ! c_other_s_t del_ar_s_t del_inv_s_t del_ap_s_t depr_s_t amort_s_t oth_acc_s_t
23 /selection= lasso (stop=none choose=validate) ;
24 ods output parameterestimates= test_2;
25 run;

ERROR: Selection aborted as there are no suitable observations for validation.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: Output 'parameterestimates' was not created. Make sure that the output object name,
label, or path is spelled correctly. Also, verify that the appropriate procedure options
are used to produce the requested output object. For example, verify that the NOPRINT
option is not used.
NOTE: There were 7513 observations read from the data set MYLIBF1.ENDVERSION.
WHERE (Houyear>=2000 and Houyear<=2004);
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 0.18 seconds
cpu time 0.06 seconds

sbxkoenk
SAS Super FREQ

Hello,

 

I haven't tested it (I leave that up to you 😉 ) but I would guess that the where-clause applies to all incoming datasets, also the VALDATA= ds. Hence, no observations qualify for validation anymore which is a problem with choose=validate of course.

 

Cheers,

Koen

 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1953 views
  • 1 like
  • 2 in conversation