BookmarkSubscribeRSS Feed
Kvothe_irl
Calcite | Level 5

First off, my apologies if this is a frequently asked or poorly worded question. I have only recently started using SAS, so I'm still trying to grasp the basics. That being said, these are my problems:

 

Short Version:

In PROC REG, can you define a group of variables and only generate regressions using EXACTLY one  variable from the group?

In PROC REG, can you define a group of variables and only generate regressions using AT MOST one variable from the group?

In PROC REG, can you define a group of variables and only generate regressions that use either ALL or NONE of the group?

 

Long Version:

My data set is sales data for a fictional retail store. My dependent variable is sales, and the data set began with hundreds of variables about each store location, including several describing consumer income near the store, several describing X trait (and Y trait, and Z trait, etc) of local households, and a multi-characteristic dummy variable that describes the geographical region of the store (NE_dummy, SE_dummy, W_dummy, etc). These variables have been pared down to 20-25 that either correlate significantly with sales or need to be included from a theory/logical perspective.

 

My current SAS code essentially boils down to this: 

 

PROC REG;
        model sales = var1 .... var20 / selection=CP start=6 stop=10 best=100;
run;

 

Since I am essentially hoping to generate a demand function, theory dictates I include consumer income. However, income has a fairly weak correlation with sales, so just throwing all regressors in to PROC MEANS & generating the 100 best models by Mallows' Cp ("blindly" generating regressions) yields no models with any income variable. I wish to force the model to use exactly one regressor from the group of consumer income variables. Is this possible? Or if it's not possible for the group case, can you force SAS to only generate regressions that use a given variable (ie sales = avgincome_5miles + {any combination of other variables}).

 

Variables about consumer trait X (and Y, Z, etc) are defined based on radius around the given store - 1 radial mile, 5 radial miles, 10 radial miles. "Blindly" generating regressions often yields two or three from a single group. I want my model to include at most one regressor from X, at most one regressor from Y, and at most 1 regressor from Z. Is there any way to do this?

 

For the multi-characteristic regional dummy (NE_dummy, SE_dummy, W_dummy, etc), "blindly" generating regressions will yield models with only one or two of the dummies in the group. I want the model to either include ALL of them (not regressing on the base group, SW_dummy of course) or NONE of them . Is there any way to do this?

 

Thank you for reading through my question.

2 REPLIES 2
data_null__
Jade | Level 19

If you have class variables I think you should be looking at GLMSELECT.

 

The GLMSELECT procedure compares most closely to REG and GLM. The REG procedure supports a variety of model-selection methodsbut does not support a CLASS statement. The GLM procedure supports a CLASS statement but does not include effect selectionmethods. The GLMSELECT procedure fills this gap. GLMSELECT focuses on the standard independently and identically distributedgeneral linear model for univariate responses and offers great flexibility for and insight into the model selection algorithm.GLMSELECT provides results (displayed tables, output data sets, and macro variables) that make it easy to take the selectedmodel and explore it in more detail in a subsequent procedure such as REG or GLM.

PaigeMiller
Diamond | Level 26

The INCLUDE= option of the MODEL statement in PROC REG does this.

 

INCLUDE=4 forces the modeling to use the first 4 variables listed on the right hand side of the equal sing in the MODEL statement. Although, if consumer income s not really a good predictor, I would think you might want to avoid using it, because it could possibly lead to poorer predictions.

 

Edit: "... because it could possibly lead to lower adjusted R-squared values". In other words, you are probably not improving the model and in some sense adding it in can be harmful.

--
Paige Miller

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 343 views
  • 0 likes
  • 3 in conversation