Hello All,
I have a dataset with over 500k observations and about 500 variables. About 20 variables are categorical, nominal, or datetime and the rest are numeric. I want to build a model using this dataset. I have a dependent variable that is binary (1 or 0) but there are too many dependent variables. I want to reduce the number of dependent variables to about 30. Someone suggested I use a random forest and an importance plot to find the 30 most importance variables. I have never used random forest before but I have a basic understanding of the theory behind it.
edit: Also someone suggested Chi-square for feature selection.
Could you please show me an efficient way to find the 30 most important variables. I am using SAS enterprise 7.1.
Any help would be great