BookmarkSubscribeRSS Feed
Pritish
Quartz | Level 8

Hi,

I have around 400 variables (indepedent) and 1 dependent variable (categorical). I am trying to find out which independent variables really helps me to predict the dependent variable. So far, I have used Logistic regression to identifiy those top n variables but the problem is since I have missing values, logistic is ignoring all those account having missing values. The only way, I know to solve that problem and use logistic is to replace those missing value with imputed values. But, I haven't tried that approach.

I was wondering, if there is any other approach you would recommend in finding the top variables and avoid using imputed value?

I appreciate your time and help!


4 REPLIES 4
Anotherdream
Quartz | Level 8

Theoritically you could just use stepwise regression, however there are "issues" with this method as pointed out by many. If you are not comfortable with step-wise, please see the associated paper. It gives a "unique" alternative to the common stepwise method.

http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf

Enjoy.

Reeza
Super User

are your independent variables categorical or continuous?

If they're categorical one option is to code missing as a particular state to include it.

If you have access to E-Miner, then a tree/CART model is useful.

There's also a principal components or factor procedure that can help, but I can't recall the name off the top of my head.

Pritish
Quartz | Level 8

My independent variables are combination of both - categorical and continuous ..

i am definitely considering Decision tree for selecting the attributes, since I am having some good results with that. But is that the right approach for variable selection?

I will take a look at Proc GLMSelect and see if it helps.

Anotherdream
Quartz | Level 8

Okay so this might not be a helpful answer, but it's the best one I can give.

You are asking for the "right" approach to something that does not have a descerniable "right" approach. The theory behind variable selection is massive and very diverse, and each method has its own benefits and draw-backs.

In fact, one could teach just the subject of variable selection for a two - three semester statistics class without repeating themself, it is that diverse. Some people do use Decision trees, other use stepwise (I don't like this method), other use LASSO, others use simple correlation matricies, etc.. There are literally dozens if not hundreds of ways of doing this.

Example here is a way using rapid miner that talks about both the correlation matrcies and the decision matricies "http://www.simafore.com/blog/bid/81836/2-ways-to-select-predictors-for-regression-models-using-Rapid...".

Another good paper to start is "

sabelle Guyon, André Elisseeff, "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, 3(Mar):1157-1182, 2003. (www)

"

If all of your variables are numeric (even if they are categorical) then a LOT of statistics people recommend the LASSO/LARS methodology. If you search the web on this methodology i'm sure you will find numerous papers that can help further your understanding!

I hope I helped.


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1421 views
  • 4 likes
  • 3 in conversation