Programming the statistical procedures from SAS

PROC GLMSELECT Model option when dataset has a large number of variables

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

PROC GLMSELECT Model option when dataset has a large number of variables

I have a dataset that has a very large number of variables and I am trying to use the PROC GLMSELECT with the LASSO option to select the most important variables. However, I am having problems with how to specify the model. I do not want to write down the names of the 1000 variables. Is there a way to write the model in a more compact way something like the program below?

PROC GLMSELECT DATA=PCData PLOTS =COEFFICIENTS;
MODEL y=(All Vars in Dataset)/ SELECTION=LASSO;
RUN;

 I am using SAS 9.4. 

 

Tahnks

 


Accepted Solutions
Solution
‎02-12-2018 09:08 PM
PROC Star
Posts: 1,283

Re: PROC GLMSELECT Model option when dataset has a large number of variables

[ Edited ]
Posted in reply to Babinetos

@OskarE has written a an article on this very topic in the post Automatic modeling with thousands/millions of inputs but only a few lines of code!.

 

You can use the PROC CONTENTS approach as in the post or retreive the variables from dictionary.columns like this

 

proc sql noprint;
	select name into :GLMVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(name) not contains "SALARY";
	select name into :ClassVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(type)="CHAR" and upcase(name) not contains "SALARY";
quit;

%put &GLMVars.;
%put &ClassVars.;

proc glmselect data=sashelp.baseball;
	class &ClassVars.;
	model salary= &GLMVars. / selection=lasso;
run;

View solution in original post


All Replies
Solution
‎02-12-2018 09:08 PM
PROC Star
Posts: 1,283

Re: PROC GLMSELECT Model option when dataset has a large number of variables

[ Edited ]
Posted in reply to Babinetos

@OskarE has written a an article on this very topic in the post Automatic modeling with thousands/millions of inputs but only a few lines of code!.

 

You can use the PROC CONTENTS approach as in the post or retreive the variables from dictionary.columns like this

 

proc sql noprint;
	select name into :GLMVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(name) not contains "SALARY";
	select name into :ClassVars separated by ' ' from dictionary.columns
	where libname="SASHELP" and memname="BASEBALL" and upcase(type)="CHAR" and upcase(name) not contains "SALARY";
quit;

%put &GLMVars.;
%put &ClassVars.;

proc glmselect data=sashelp.baseball;
	class &ClassVars.;
	model salary= &GLMVars. / selection=lasso;
run;
Occasional Contributor
Posts: 5

Re: PROC GLMSELECT Model option when dataset has a large number of variables

Thanks for the Link as well! It was very useful!
SAS Super FREQ
Posts: 4,245

Re: PROC GLMSELECT Model option when dataset has a large number of variables

Posted in reply to Babinetos

The usual way to do this is to use the _NUMERIC_ keyword, which means "use all numeric variable":

model y = _NUMERIC_ / selection=lasso;

 

Unfortunately, this won't work here because _NUMERIC_ includes the response variable (Y), and of course that variable explains all the variation so the procedure will select Y and stop!

 

If all the variable begin with the same letter (such as X1, X2, X3, ...) you can use a colon as a wildcard:

model y = x: / selection=lasso;

 

Otherwise, I suggest you select all numeric variables that are NOT the response into a macro variable. You can use PROC CONTENTS to get the variables and PROC SQL to create the macro variable, as follows:

 

proc contents data=sashelp.cars(drop=_CHARACTER_ mpg_city) /* Y */
     out=varnames(keep = varnum name) noprint;
run;

proc sql noprint;
   select name into :XVars separated by ' '
   from varnames;
quit; 

%put &XVars=;

proc glmselect data=sashelp.cars;
model mpg_city= &XVars / selection=lasso;
run;
Occasional Contributor
Posts: 5

Re: PROC GLMSELECT Model option when dataset has a large number of variables

Thanks for the answer! I liked the wildcard option as well!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 159 views
  • 0 likes
  • 3 in conversation