Hi,
I have a dataset with 3 independent variables (var1 - var3) and one dependent variable(var0). I need to run regression for all possible combinations of 3 variables.
data have;
input var0 var1 var2 var3;
Datalines;
0.17 3.84 15.60 17.15
0.13 3.72 1.90 17.46
0.18 8.44 22.80 12.37
0.14 6.29 5.60 8.73
;
run;
regression to be run for combinations
var1
var2
var3
var1 var2
var2 var3
and so on..
Thanks.
OK. You want this ?
data have;
input var0 var1 var2 var3;
Datalines;
0.17 3.84 15.60 17.15
0.13 3.72 1.90 17.46
0.18 8.44 22.80 12.37
0.14 6.29 5.60 8.73
;
run;
%macro reg_all_comb(dsn=, y= , x= );
%let n=%sysfunc(countw(&x.,%str( ))); %put &=n ;
data all_comb;
length comb $ 200;
array x{&n.};
k=-1;
do i=1 to 2**&n.;
rc=graycode(k,of x{*});
do j=1 to &n.;
if x{j}=1 then comb=catx(' ',comb,scan("&x.",j,' '));
end;
output;call missing(comb);
end;
run;
data _null_;
set all_comb(where=(comb is not missing)) end=last;
if _n_=1 then call execute(catt("ods select none;
ods output FitStatistics=FitStatistics ParameterEstimates= ParameterEstimates;
proc reg data=&dsn. ;"));
call execute(catt(compress(comb),": model &y.=",comb,";"));
if last then call execute('quit; ods select all;');
run;
%mend;
%reg_all_comb(dsn=have, y=var0 , x=var1 var2 var3 )
selection = option will do variable selection.
var1|var2|var3 is a short specification of a full factorial model:
var1 var2 var3 var1*var2 var1*var3 var2*var3 var1*var2*var3
So don't use @2 there, it means the maximum number of variables involved is 2, factor "var1*var2*var3" would be removed.
You want this ?
data have;
input var0 var1 var2 var3;
Datalines;
0.17 3.84 15.60 17.15
0.13 3.72 1.90 17.46
0.18 8.44 22.80 12.37
0.14 6.29 5.60 8.73
;
run;
%macro reg_all_comb(dsn=, y= , x= );
%let n=%sysfunc(countw(&x.,%str( ))); %put &=n ;
data all_comb;
length comb $ 200;
array x{&n.};
k=-1;
do i=1 to 2**&n.;
rc=graycode(k,of x{*});
do j=1 to &n.;
if x{j}=1 then comb=catx(' ',comb,scan("&x.",j,' '));
end;
output;call missing(comb);
end;
run;
data _null_;
set all_comb(where=(comb is not missing)) end=last;
if _n_=1 then call execute(catt("proc reg data=&dsn. ;"));
call execute(catt(" model &y.=",comb,";"));
if last then call execute('quit;');
run;
%mend;
%reg_all_comb(dsn=have, y=var0 , x=var1 var2 var3 )
Hi, Thank you for your response. but have few questions in the code. You have used 'Graycode' function in the code. Is it different from using 'Allcomb' function to get combinations.
OK. You want this ?
data have;
input var0 var1 var2 var3;
Datalines;
0.17 3.84 15.60 17.15
0.13 3.72 1.90 17.46
0.18 8.44 22.80 12.37
0.14 6.29 5.60 8.73
;
run;
%macro reg_all_comb(dsn=, y= , x= );
%let n=%sysfunc(countw(&x.,%str( ))); %put &=n ;
data all_comb;
length comb $ 200;
array x{&n.};
k=-1;
do i=1 to 2**&n.;
rc=graycode(k,of x{*});
do j=1 to &n.;
if x{j}=1 then comb=catx(' ',comb,scan("&x.",j,' '));
end;
output;call missing(comb);
end;
run;
data _null_;
set all_comb(where=(comb is not missing)) end=last;
if _n_=1 then call execute(catt("ods select none;
ods output FitStatistics=FitStatistics ParameterEstimates= ParameterEstimates;
proc reg data=&dsn. ;"));
call execute(catt(compress(comb),": model &y.=",comb,";"));
if last then call execute('quit; ods select all;');
run;
%mend;
%reg_all_comb(dsn=have, y=var0 , x=var1 var2 var3 )
If you use SELECTION=RSQUARE and STOP=3 on the MODEL statement in PROC REG, that should give you all possible regressions with up to 3 variables in the model. Try the code below. This code can be extremely resource intensive many variables listed on the MODEL statement. If you have 10 variables, then the code will yield c(10,1) + c(10,2) + c(10,3) = 10 + 45 + 120 = 175 regresions. If you run with 100 variables, then that is 100 + 450 + 161,700 = 162,250 models. If you run with 1000 variables, then thats ~166 million models.
data test; call streaminit(51436); array x{10} x1-x10; do i=1 to 100; y=rand("normal"); do j=1 to 10; x{j}=rand("normal"); end; output; end; run; proc reg data=test outest=stats; model y=x1-x10 / selection=rsquare stop=3; run; proc print data=stats; run;
The OUTEST= data set will contain the _rmse_ and _rsquare_ fields along with the intercept and slopes for each model. If you change to SELECTION=ADJRSQ and add the ADJRSQ option to the MODEL statement, then the OUTEST= data set will also contain the adjusted r-square. If there are other statistics needed, like Mallow's CP, then there are options on the MODEL statement to include those statistics in the OUTEST= data set.
greetings from @Rick_SAS , who wrote a blog post about the SWEEP operator.
Do it with the sweep operator
data test;
call streaminit(51436);
array x{10} x1-x10;
do i=1 to 100;
y=rand("normal");
do j=1 to 10;
x{j}=rand("normal");
end;
output;
end;
run;
ods graphics on;
%let numsim=10;
proc iml;
xVarNames = "X1":"X&numSim"; /* names of explanatory variables */
varNames = xVarNames || "y" ; /* name of all data variables */
use test; read all var varNames into M [colname=varyplus];
close;
M = j(nrow(M), 1, 1) || M; /* add intercept column */
varyplus="intercept" || varnames;
mattrib m c=varyplus;
tss=(m[, {"y"}]-(m[, {"y"}] [:])) [##];
model_vars="x1":"x10";
vars=10;
max_cross=3;
ncomb=0;
do t=1 to max_cross;
ncomb=ncomb + comb(vars, t);
end;
results=t(1:ncomb) || j(ncomb, max_cross + 2, .);
model_info=j(ncomb, max_cross, " ");
cnt=0;
do i=1 to max_cross;
idx=allcomb(vars, i)+1;
idx=j(nrow(idx),1,1)||idx;
do u=1 to nrow(idx);
S1 = sweep(M`*M, idx[u,]);
rss=((t(S1[idx[u,], ncol(m)])#m[,idx[u,]]) [,+] - m[, {"y"}]) [##] ;
rsq=1-rss/tss;
cnt=cnt+1;
results[cnt,2:i+2]=S1[idx[u,],nrow(s1)]`;
results[cnt, ncol(results)]=rsq;
model_info[cnt,1:ncol(idx[u,])-1]=model_vars[idx[u,2:ncol(idx)]-1]`;
end;
end;
call symputx("cross", max_cross);
call sortndx(rr, results, ncol(results));
results=results[rr,];
model_info=model_info[rr,];
names={"obs" "intercept"} || ("est1":"est&cross.") || {"_rsq_" "rank"} ;
names2="model_var1":"model_var&cross.";
create parameter_estimate from results [colname=names];
append from results;
close;
create model_var from model_info [colname=names2];
append from model_info;
close;
quit;
data final_result;
merge parameter_estimate model_var;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.