07-04-2011 01:59 PM
I am using PROC GENMOD to run logistic regression for a data. There are many explanatory variables (>25), most of which are nominal type with multiple levels.
It seemed that I have to put all variables into the model, and manually exclude one at a time until achieving all significant variables. Given that there are too many variables, this is not a clever way to proceed. I searched and cound not find any automatic selection method in PROC GENMOD.
I am posting my problem here. Any good idea?
07-05-2011 05:17 AM
Does PROC GENMOD really have no automatic variable selection method? This will be a cumbersome task to run the model and manually delete insignificant terms.
07-05-2011 10:19 AM
There is no automatic variable selection in GENMOD. You should know that such methods are controversial in statistics, and many argue strongly against automatic methods. At best, you should use the methods in an exploratory sense, to help you understand your data.
Since you have binary data, you can use PROC LOGISTIC instead of GENMOD. LOGISTIC does have variable selection methods. Check out the SELECTION= option on the model statement.
07-05-2011 11:55 AM
Thanks for your idea, Ivm.
My data cannot allow me to use proc logistic as most nominal variables have over 5 levels. So better use proc genmod which automatically creates dummy variables.
Anyway, I can do the manual stepwise deletion of variables.
07-05-2011 12:28 PM
Why do you say that you cannot use PROC LOGISTIC? The LOGISTIC procedure has a CLASS statement. The purpose of the CLASS statement is to expand categorical variables so that the design matrix has dummy variables representing all of the levels of the predictor variables. It is only through the mechanisms of the CLASS statement that PROC GENMOD is able to expand a nominal predictor variable into a set of dummy variables.
By the way, I would be careful about application of stepwise selection methods when you have categorical predictor variables. If categorical predictor variable A has 5 levels, then your stepwise selection may keep a couple of levels of A as important predictors and remove the other levels of A as unimportant predictors. This can lead to some real model confusion. I don't know if the implementation of stepwise selection methods in PROC LOGISTIC operates this way (selecting one column at a time from the design matrix). But that is a typical implementation of stepwise selection. It is rare to find implementation of stepwise selection methods which test all levels of a categorical predictor variable for simultaneous inclusion/exclusion from the model.
There are other statistical issues with stepwise selection method. They typically produce incorrect models. As lvm has stated, stepwise selection should only be used for exploratory analysis. Models suggested by stepwise selection methods should be confirmed in a separate investigation.
07-06-2011 12:54 PM
Thanks for correcting me. I am using the book: Logistic Regression Using SAS: Theory and Application, by Paul D. Allison.
Following your suggestion, I checked and found many contents of the book are out of date. For example, it says that PROC LOGISTIC needs to manually create dummy variables, it cannot specify multiplicative terms (i.e. interaction) in the MODEL statement. As new SAS version is released, more updates have been added for many procedures.
The book was published in 1999.
12-05-2014 05:09 PM
I have exactly the same problem. My data has about 70 variables that are to used in the logistic regression as predictor variables (all norminal with multiple levels) and I started by running a Pearson's Chi-Square between each of them and the binary outcome. Then I picked only the significant ones for my logistic regression which again are still too many (27 of them). I have tried running the model using Proc Genmod but it is not converging as a result of too many predictors I suppose. I thought of using Proc Logistic but the problem is that Proc Logistic does not allow you to specify the reference category in the class statement and I want particular categories as references. The advice I got from a friend is that I should run Spearman Rank Correlation between the predictors and then drop one of two highly correlated variables. I think this approach is not that bad and I suggest you try it
12-08-2014 10:00 AM
You can use the procedure HPGENSELECT. It Works exactly as GENMOD, except that it also can do some selection algorithms.
forexample you can write:
PROC HPGENSELECT data=MYDATA;
model y=classvariables1-classvariables20/dist=binary link=logit;
It requires SAS 9.4.