05-14-2013 03:41 PM
I have around 400 variables (indepedent) and 1 dependent variable (categorical). I am trying to find out which independent variables really helps me to predict the dependent variable. So far, I have used Logistic regression to identifiy those top n variables but the problem is since I have missing values, logistic is ignoring all those account having missing values. The only way, I know to solve that problem and use logistic is to replace those missing value with imputed values. But, I haven't tried that approach.
I was wondering, if there is any other approach you would recommend in finding the top variables and avoid using imputed value?
I appreciate your time and help!
05-14-2013 04:41 PM
Theoritically you could just use stepwise regression, however there are "issues" with this method as pointed out by many. If you are not comfortable with step-wise, please see the associated paper. It gives a "unique" alternative to the common stepwise method.
05-14-2013 05:00 PM
are your independent variables categorical or continuous?
If they're categorical one option is to code missing as a particular state to include it.
If you have access to E-Miner, then a tree/CART model is useful.
There's also a principal components or factor procedure that can help, but I can't recall the name off the top of my head.
05-14-2013 05:38 PM
My independent variables are combination of both - categorical and continuous ..
i am definitely considering Decision tree for selecting the attributes, since I am having some good results with that. But is that the right approach for variable selection?
I will take a look at Proc GLMSelect and see if it helps.
05-14-2013 05:47 PM
Okay so this might not be a helpful answer, but it's the best one I can give.
You are asking for the "right" approach to something that does not have a descerniable "right" approach. The theory behind variable selection is massive and very diverse, and each method has its own benefits and draw-backs.
In fact, one could teach just the subject of variable selection for a two - three semester statistics class without repeating themself, it is that diverse. Some people do use Decision trees, other use stepwise (I don't like this method), other use LASSO, others use simple correlation matricies, etc.. There are literally dozens if not hundreds of ways of doing this.
Example here is a way using rapid miner that talks about both the correlation matrcies and the decision matricies "http://www.simafore.com/blog/bid/81836/2-ways-to-select-predictors-for-regression-models-using-Rapid...".
Another good paper to start is "
sabelle Guyon, André Elisseeff, "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, 3(Mar):1157-1182, 2003. (www)
If all of your variables are numeric (even if they are categorical) then a LOT of statistics people recommend the LASSO/LARS methodology. If you search the web on this methodology i'm sure you will find numerous papers that can help further your understanding!
I hope I helped.