Dear friends, SAS communities, I have searched for this question but haven't got any answer. I realized that the best subset selection in SAS is unusually fast, and it is impossible to scan all the combinations in such a short time. When I use R to do best subset selection (use 'leaps' package), it took 3 hours (thus I trust it does scan all the 2^p combination, I have p = 50, which gives over a billion models). And SAS only used 1 second. Actually, the output from SAS is the same as the result of stepwise selection in R. (Remark: the comments by FreelanceReinhard is right, I think R did not search all the combinations either... 2^50 gives over 10^15 combinations...) So my question is, does SAS actually use stepwise for "best subset" selection when the number of features is above some certain number? The code: # SAS code for best subset selection:
proc reg data = mydata4 plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best =5 vif stb;
run;
quit;
# R code for stepwise selection, which gives same results as the SAS code above:
fit_allvars <- lm(Share_Temporary ~ ., data = mydata4)
step <- stepAIC(fit_allvars, direction = "both") Moreover, if I do stepwise in SAS, it will give a shorter list of variables, but all of them are contained in the selection result of best subset from SAS. /* SAS code for best subset selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb;
run;
quit;
/* SAS code for stepwise selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=stepwise;
run;
quit; A comparison of SAS results using the code above: same variables are highlighted. All the variables that are in stepwise selection result are included in the "Best Subset" selection result: I have discussed in more details in my blog, but this seems to be the case and I have no way to find further explanations on it. If anyone could help to share some insights I will really appreciate. It has really puzzled me. Thank you for your time and advice, Best Regards, Yang P.S. I have attached the dataset. There are 49 predictors starting at column C (A, B are ID, intercept). The dependent variable is the last column. The SAS code below can be applied directly (after changing the directory). Dimension: 598 * 51 proc import datafile="D:\....... \mydata4.csv" out= mydata dbms=csv replace; run; /*proc print data = mydata (obs = 10);*/ /*run;*/ proc corr data = mydata noprob; run; /* SAS code for best subset selection*/ proc reg data = mydata plot = none; model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb; run; quit; /* SAS code for stepwise selection*/ proc reg data = mydata plot = none; model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=stepwise; run; quit;
... View more