turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- I think SAS’s Best Subset selection in proc reg is...

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

2 weeks ago - last edited 2 weeks ago

Dear friends, SAS communities,

I have searched for this question but haven't got any answer. I realized that the best subset selection in SAS is unusually fast, and it is impossible to scan all the combinations in such a short time. When I use R to do best subset selection (use 'leaps' package), it took 3 hours (thus I trust it does scan all the 2^p combination, I have p = 50, which gives over a billion models). And SAS only used 1 second. Actually, the output from SAS is the same as the result of stepwise selection in R.

(Remark: the comments by *FreelanceReinhard *is right, I think R did not search all the combinations either... 2^50 gives over 10^15 combinations...)

So my question is, does SAS actually use stepwise for "best subset" selection when the number of features is above some certain number?

The code:

```
# SAS code for best subset selection:
proc reg data = mydata4 plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best =5 vif stb;
run;
quit;
# R code for stepwise selection, which gives
```**same results as the SAS code above**:
fit_allvars <- lm(Share_Temporary ~ ., data = mydata4)
step <- stepAIC(fit_allvars, direction = "both")

Moreover, if I do stepwise in SAS, it will give a shorter list of variables, but all of them are contained in the selection result of best subset from SAS.

```
/* SAS code for best subset selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb;
run;
quit;
/* SAS code for stepwise selection*/
proc reg data = mydata plot = none;
model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=stepwise;
run;
quit;
```

A comparison of SAS results using the code above: same variables are highlighted. All the variables that are in stepwise selection result are included in the "Best Subset" selection result:

I have discussed in more details in my blog, but this seems to be the case and I have no way to find further explanations on it.

If anyone could help to share some insights I will really appreciate. It has really puzzled me.

Thank you for your time and advice,

Best Regards,

Yang

P.S. I have attached the dataset. There are 49 predictors starting at column C (A, B are ID, intercept). The dependent variable is the last column. The SAS code below can be applied directly (after changing the directory).

Dimension: 598 * 51

proc import datafile="D:\....... \mydata4.csv"

out= mydata dbms=csv replace;

run;

/*proc print data = mydata (obs = 10);*/

/*run;*/

proc corr data = mydata noprob;

run;

/* SAS code for best subset selection*/

proc reg data = mydata plot = none;

model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=cp best = 3 stb;

run;

quit;

/* SAS code for stepwise selection*/

proc reg data = mydata plot = none;

model Share_Temporary = CC10_Household_Size -- JJ1_Electricity_Availableyes /selection=stepwise;

run;

quit;

Accepted Solutions

Solution

2 weeks ago

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to liuyangnyu

2 weeks ago

I asked SAS support and got a great reply in a day from Kathleen. The answer is below:

With selection-RSQUARE, ADJRSQ, and CP and n=number of regressors >=11, by default REG will only DISPLAY the best n subset models for each number of regressors. The **best n one variable models, best n two variable models, etc. These can be computed (using the Furnival and Wilson algorithm) without examining every possible model of every possible size and so this is typically much faster than if all models of each size need to be displayed. **

By default, when you run PROC REG with the SELECTION=CP and the STOP=10 option, and you have 20 regressors in the model, PROC REG will display at most 20 models for each of the 1-variable models, 2-variable models, 3-variable models, ...through 10-variable models. In other words, the maximum number of models displayed will be equal to the number of predictor variables in the MODEL statement (if the number of predictors listed in the MODEL statement is greater than 11).

To obtain more models than the number displayed by default, you will need to add the BEST= option to the MODEL statement. For example, if you have 20 predictors in your MODEL statement, but you want to see up to 35 models in each of the possible subsets, then your PROC REG step would need to look something like:

---------------------

proc reg data=test;

model y = x1-x20 / selection=rsquare stop=10 best=35;

run;

quit;

----------------------

I hope the above information is helpful.

Kathleen Kiernan

Senior Principal Technical Support Statistician

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to liuyangnyu

2 weeks ago

Can you upload your dat so we can replicate your results?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

2 weeks ago

I'm thinking the Best Subset selection is this algorithm

R. R. Hocking & R. N. Leslie. "Selection of the Best Subset in Regression Analysis", Technometrics, Vol 9, 1967, pp 531-540

https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1967.10490502#.WzqKrdVKhhE

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

2 weeks ago

Thank you, Miller, for your advice! I have no question with the algorithm, I simply think neither SAS or R actually scan all the possible combinations. I appreciate your help.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

2 weeks ago

Thank you! I have attached the dataset.

There are 49 predictors starting at column C (A, B are ID, intercept). The dependent variable is the last column ('Share_Temporary'). The SAS code below can be applied directly (after changing the directory).

Dimension: 598 * 51

The dataset is based on a survey in African slum, trying to predict the share of the temporary structures by other variables. The dataset has been processed so there are lots of dummy variables.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to liuyangnyu

2 weeks ago

@liuyangnyu wrote:

When I use R to do best subset selection (use 'leaps' package), it took 3 hours (thus I trust it does scan all the 2^p combination, I have p = 50, which gives over a billion models).

Just a remark: 2^50>1.1*10^15. Three hours are 10800 seconds. 1.1*10^15/10800>10^11. Do you still trust your computer is able to scan >100 billion regression models *per second*?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

2 weeks ago

Dear friend,

You made a great point. Obviously, R did not scan all of them either. It might be a mission impossible (but the results from R is still different from best-subset). But what puzzles me is when there are fewer features, would SAS really scan all the parameters or not...

To be precise, since I set the program to search from 49 predictors but set the maximum size of subsets to be 25, there are C(49,25) + C(49,24) + ...+ C(49,0) = 3.447e+14 models to check. That is still too many to be true.

Thank you very much,

Yang

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to liuyangnyu

2 weeks ago - last edited 2 weeks ago

I am not a statistical-algorithm expert, but I know that there are clever "shortcuts" to some algorithmic tasks, possibly this one. I would ask SAS Technical Support. They are there for you, use them (one huge advantage over R packages).

Solution

2 weeks ago

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to liuyangnyu

2 weeks ago

I asked SAS support and got a great reply in a day from Kathleen. The answer is below:

With selection-RSQUARE, ADJRSQ, and CP and n=number of regressors >=11, by default REG will only DISPLAY the best n subset models for each number of regressors. The **best n one variable models, best n two variable models, etc. These can be computed (using the Furnival and Wilson algorithm) without examining every possible model of every possible size and so this is typically much faster than if all models of each size need to be displayed. **

By default, when you run PROC REG with the SELECTION=CP and the STOP=10 option, and you have 20 regressors in the model, PROC REG will display at most 20 models for each of the 1-variable models, 2-variable models, 3-variable models, ...through 10-variable models. In other words, the maximum number of models displayed will be equal to the number of predictor variables in the MODEL statement (if the number of predictors listed in the MODEL statement is greater than 11).

To obtain more models than the number displayed by default, you will need to add the BEST= option to the MODEL statement. For example, if you have 20 predictors in your MODEL statement, but you want to see up to 35 models in each of the possible subsets, then your PROC REG step would need to look something like:

---------------------

proc reg data=test;

model y = x1-x20 / selection=rsquare stop=10 best=35;

run;

quit;

----------------------

I hope the above information is helpful.

Kathleen Kiernan

Senior Principal Technical Support Statistician