BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
liuyangnyu
Fluorite | Level 6

Dear friends, SAS communities, 

 

I have searched for this question but haven't got any answer. I realized that the best subset selection in SAS is unusually fast, and it is impossible to scan all the combinations in such a short time. When I use R to do best subset selection (use 'leaps' package), it took 3 hours (thus I trust it does scan all the 2^p combination, I have p = 50, which gives over a billion models). And SAS only used 1 second. Actually, the output from SAS is the same as the result of stepwise selection in R. 

(Remark: the comments by FreelanceReinhard  is right, I think R did not search all the combinations either... 2^50 gives over 10^15 combinations...)

 

So my question is, does SAS actually use stepwise for "best subset" selection when the number of features is above some certain number?

The code:

# SAS code for best subset selection:
proc reg data = mydata4 plot = none;
model Share_Temporary = CC10_Household_Size  -- JJ1_Electricity_Availableyes /selection=cp best =5 vif stb;
run;
quit;

# R code for stepwise selection, which gives same results as the SAS code above:
fit_allvars <- lm(Share_Temporary ~ ., data = mydata4)
step <- stepAIC(fit_allvars, direction = "both")


 

Moreover, if I do stepwise in SAS, it will give a shorter list of variables, but all of them are contained in the selection result of best subset from SAS.

/* SAS code for best subset selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size  -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb;
run;
quit;

/* SAS code for stepwise selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size  -- JJ1_Electricity_Availableyes /selection=stepwise;
run;
quit;

A comparison of SAS results using the code above: same variables are highlighted. All the variables that are in stepwise selection result are included in the "Best Subset" selection result:

comparison 2.JPG

 

I have discussed in more details in my blog, but this seems to be the case and I have no way to find further explanations on it.

If anyone could help to share some insights I will really appreciate. It has really puzzled me. 

 

Thank you for your time and advice,

Best Regards,

Yang 

 

P.S. I have attached the dataset. There are 49 predictors starting at column C (A, B are ID, intercept). The dependent variable is the last column. The SAS code below can be applied directly (after changing the directory).

Dimension: 598 * 51

 

proc import datafile="D:\....... \mydata4.csv"
out= mydata dbms=csv replace;
run;

/*proc print data = mydata (obs = 10);*/
/*run;*/

proc corr data = mydata noprob;
run;


/* SAS code for best subset selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb;
run;
quit;

/* SAS code for stepwise selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=stepwise;
run;
quit;

1 ACCEPTED SOLUTION

Accepted Solutions
liuyangnyu
Fluorite | Level 6

I asked SAS support and got a great reply in a day from Kathleen. The answer is below:

 

With selection-RSQUARE, ADJRSQ, and CP and n=number of regressors >=11, by default REG will only DISPLAY the best n subset models for each number of regressors. The best n one variable models, best n two  variable models, etc. These can be computed (using the Furnival and Wilson algorithm) without examining every possible model of every possible size and so this is typically much faster than if all models of each size need to be displayed.

 

By default, when you run PROC REG with the SELECTION=CP and the STOP=10 option, and you have 20 regressors in the model, PROC REG will display at most 20 models for each of the 1-variable models, 2-variable models, 3-variable models, ...through 10-variable models.  In other words, the maximum number of models displayed will be equal to the number of predictor variables in the MODEL statement (if the number of predictors listed in the MODEL statement is greater than 11).  

To obtain more models than the number displayed by default, you will need to add the BEST= option to the MODEL statement.  For example, if you have 20 predictors in your MODEL statement, but you want to see up to 35 models in each of the possible subsets, then your PROC REG step would need to look something like:

---------------------
proc reg data=test;
  model y = x1-x20 / selection=rsquare stop=10 best=35;
run;
quit;
----------------------

 

I hope the above information is helpful.

 

Kathleen Kiernan

Senior Principal Technical Support Statistician

View solution in original post

8 REPLIES 8
Reeza
Super User

Can you upload your dat so we can replicate your results?

PaigeMiller
Diamond | Level 26

I'm thinking the Best Subset selection is this algorithm


R. R. Hocking & R. N. Leslie. "Selection of the Best Subset in Regression Analysis", Technometrics, Vol 9, 1967, pp 531-540

https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1967.10490502#.WzqKrdVKhhE

--
Paige Miller
liuyangnyu
Fluorite | Level 6

Thank you, Miller, for your advice! I have no question with the algorithm, I simply think neither SAS or R actually scan all the possible combinations. I appreciate your help. 

liuyangnyu
Fluorite | Level 6

Thank you! I have attached the dataset. 

There are 49 predictors starting at column C (A, B are ID, intercept). The dependent variable is the last column ('Share_Temporary'). The SAS code below can be applied directly (after changing the directory).

Dimension: 598 * 51

The dataset is based on a survey in African slum, trying to predict the share of the temporary structures by other variables. The dataset has been processed so there are lots of dummy variables. 

FreelanceReinh
Jade | Level 19

@liuyangnyu wrote:

When I use R to do best subset selection (use 'leaps' package), it took 3 hours (thus I trust it does scan all the 2^p combination, I have p = 50, which gives over a billion models).


Just a remark: 2^50>1.1*10^15. Three hours are 10800 seconds. 1.1*10^15/10800>10^11. Do you still trust your computer is able to scan >100 billion regression models per second?

liuyangnyu
Fluorite | Level 6

Dear friend, 

 

You made a great point. Obviously, R did not scan all of them either. It might be a mission impossible (but the results from R is still different from best-subset). But what puzzles me is when there are fewer features, would SAS really scan all the parameters or not... 

 

 

To be precise, since I set the program to search from 49 predictors but set the maximum size of subsets to be 25, there are C(49,25) + C(49,24) + ...+ C(49,0) = 3.447e+14  models to check. That is still too many to be true. 

 

 

Thank you very much,

Yang 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

I am not a statistical-algorithm expert, but I know that there are clever "shortcuts" to some algorithmic tasks, possibly this one. I would ask SAS Technical Support. They are there for you, use them (one huge advantage over R packages).

 

liuyangnyu
Fluorite | Level 6

I asked SAS support and got a great reply in a day from Kathleen. The answer is below:

 

With selection-RSQUARE, ADJRSQ, and CP and n=number of regressors >=11, by default REG will only DISPLAY the best n subset models for each number of regressors. The best n one variable models, best n two  variable models, etc. These can be computed (using the Furnival and Wilson algorithm) without examining every possible model of every possible size and so this is typically much faster than if all models of each size need to be displayed.

 

By default, when you run PROC REG with the SELECTION=CP and the STOP=10 option, and you have 20 regressors in the model, PROC REG will display at most 20 models for each of the 1-variable models, 2-variable models, 3-variable models, ...through 10-variable models.  In other words, the maximum number of models displayed will be equal to the number of predictor variables in the MODEL statement (if the number of predictors listed in the MODEL statement is greater than 11).  

To obtain more models than the number displayed by default, you will need to add the BEST= option to the MODEL statement.  For example, if you have 20 predictors in your MODEL statement, but you want to see up to 35 models in each of the possible subsets, then your PROC REG step would need to look something like:

---------------------
proc reg data=test;
  model y = x1-x20 / selection=rsquare stop=10 best=35;
run;
quit;
----------------------

 

I hope the above information is helpful.

 

Kathleen Kiernan

Senior Principal Technical Support Statistician

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4944 views
  • 2 likes
  • 5 in conversation