PROC REG does not support categorical predictors directly. You have to recode them into a series of 0-1 values and use them in the model. A two-level categorical variable (like gender) becomes a simple 0-1 recode and then treated as continuous. A three-level categorical variable becomes two variables, etc.
This is analogous to the reference cell recoding that can be used in PROC GLM for categorical variables. The place that it falls down is that if you use the variable selection tools in REG, then you can end up with the situation of part of a variable in the model.
@Paige agrees:
I would use PROC GLM instead of PROC REG. Your predictor variables that are categories (gender, zip) are placed in the CLASS statement.
Also consider GLMSELECT procedure. It fills the gap of allowing variable selection with CLASS variables. It also produces output that allow further analyses with REG and/or GLM. GLMSELECT treats a class variable as a single multi-degree of freedom test for inclusion/exclusion. Either all levels are in or all levels are out; it's not a piecemeal process.
@Paige points out:
One thing that I feel needs to be pointed out here is that, despite the introduction of PROC GLMSELECT by SAS, many statisticians feel that STEPWISE (including forward and backward) model selection procedures is dangerous and misleading, and advise against using such. (Yes, I know there are other selection procedures in GLMSELECT, such as LAR and LASSO, which I have no knowledge of)
Editor's note: this response consolidates several of the helpful replies in this thread. Read through the entire topic to see the conversation.
PROC REG does not support categorical predictors directly. You have to recode them into a series of 0-1 values and use them in the model. A two-level categorical variable (like gender) becomes a simple 0-1 recode and then treated as continuous. A three-level categorical variable becomes two variables, etc.
This is analogous to the reference cell recoding that can be used in PROC GLM for categorical variables. The place that it falls down is that if you use the variable selection tools in REG, then you can end up with the situation of part of a variable in the model.
@Paige agrees:
I would use PROC GLM instead of PROC REG. Your predictor variables that are categories (gender, zip) are placed in the CLASS statement.
Also consider GLMSELECT procedure. It fills the gap of allowing variable selection with CLASS variables. It also produces output that allow further analyses with REG and/or GLM. GLMSELECT treats a class variable as a single multi-degree of freedom test for inclusion/exclusion. Either all levels are in or all levels are out; it's not a piecemeal process.
@Paige points out:
One thing that I feel needs to be pointed out here is that, despite the introduction of PROC GLMSELECT by SAS, many statisticians feel that STEPWISE (including forward and backward) model selection procedures is dangerous and misleading, and advise against using such. (Yes, I know there are other selection procedures in GLMSELECT, such as LAR and LASSO, which I have no knowledge of)
Editor's note: this response consolidates several of the helpful replies in this thread. Read through the entire topic to see the conversation.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.