07-14-2013 08:56 PM
Best-subset instead of stepwise question.
Hello, I have classes of individuals grouped together from cluster analysis. I want to use discriminant analysis to determine group membership of new individuals based on a set of predictors. Normally, I use PROC STEPDISC to find a subset of predictors that go into the discriminant analysis, something like:
proc stepdisc data=training sle=0.05 singular=0.1;
However, recent literature indicates stepwise selection is not as good as evaluating all possible subsets of predictors. Is there a procedure, or otherwise, that can do this? I have looked at PHREG REG and LOGISTIC procedures, but they all seem to be based on numerical data rather than classes. Have I missed something? or should I just convert the group data from text to numerical?
Thanks in advance.
07-14-2013 10:48 PM
Best variable subset selection isn't available in PROC STEPDISC. If you have only two groups or if you want to explore group differences two groups at a time, you can perform best variable subset selection in PROC LOGISTIC
title "Discriminating groups A and B";
proc logistic data=training(where=(group in ("A", "B")));
model group(event="B") = VAR1 -- VAR25 / selection=score best=3 stop=5;
07-15-2013 07:29 AM
Hi PG, and thanks for the response. I actually have 4 groups (sometimes more). It looks like I can just use:
proc logistic data=training;
model group= VAR1 -- VAR25 / selection=score best=3 stop=5;
This is very helpful. However, is there a way to compare the output models for overfitting? e.g. are four preditors really better than three.