BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Babinetos
Calcite | Level 5

I have a dataset that has a very large number of variables and I am trying to use the PROC GLMSELECT with the LASSO option to select the most important variables. However, I am having problems with how to specify the model. I do not want to write down the names of the 1000 variables. Is there a way to write the model in a more compact way something like the program below?

PROC GLMSELECT DATA=PCData PLOTS =COEFFICIENTS;
MODEL y=(All Vars in Dataset)/ SELECTION=LASSO;
RUN;

 I am using SAS 9.4. 

 

Tahnks

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

@OskarE has written a an article on this very topic in the post Automatic modeling with thousands/millions of inputs but only a few lines of code!.

 

You can use the PROC CONTENTS approach as in the post or retreive the variables from dictionary.columns like this

 

proc sql noprint;
	select name into :GLMVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(name) not contains "SALARY";
	select name into :ClassVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(type)="CHAR" and upcase(name) not contains "SALARY";
quit;

%put &GLMVars.;
%put &ClassVars.;

proc glmselect data=sashelp.baseball;
	class &ClassVars.;
	model salary= &GLMVars. / selection=lasso;
run;

View solution in original post

4 REPLIES 4
PeterClemmensen
Tourmaline | Level 20

@OskarE has written a an article on this very topic in the post Automatic modeling with thousands/millions of inputs but only a few lines of code!.

 

You can use the PROC CONTENTS approach as in the post or retreive the variables from dictionary.columns like this

 

proc sql noprint;
	select name into :GLMVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(name) not contains "SALARY";
	select name into :ClassVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(type)="CHAR" and upcase(name) not contains "SALARY";
quit;

%put &GLMVars.;
%put &ClassVars.;

proc glmselect data=sashelp.baseball;
	class &ClassVars.;
	model salary= &GLMVars. / selection=lasso;
run;
Babinetos
Calcite | Level 5
Thanks for the Link as well! It was very useful!
Rick_SAS
SAS Super FREQ

The usual way to do this is to use the _NUMERIC_ keyword, which means "use all numeric variable":

model y = _NUMERIC_ / selection=lasso;

 

Unfortunately, this won't work here because _NUMERIC_ includes the response variable (Y), and of course that variable explains all the variation so the procedure will select Y and stop!

 

If all the variable begin with the same letter (such as X1, X2, X3, ...) you can use a colon as a wildcard:

model y = x: / selection=lasso;

 

Otherwise, I suggest you select all numeric variables that are NOT the response into a macro variable. You can use PROC CONTENTS to get the variables and PROC SQL to create the macro variable, as follows:

 

proc contents data=sashelp.cars(drop=_CHARACTER_ mpg_city) /* Y */
     out=varnames(keep = varnum name) noprint;
run;

proc sql noprint;
   select name into :XVars separated by ' '
   from varnames;
quit; 

%put &XVars=;

proc glmselect data=sashelp.cars;
model mpg_city= &XVars / selection=lasso;
run;
Babinetos
Calcite | Level 5
Thanks for the answer! I liked the wildcard option as well!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1501 views
  • 0 likes
  • 3 in conversation