Using SAS to find the most important variables out of a large number o...

de95 · Posted 07-31-2019 11:54 AM

Hello All,

I have a dataset with over 500k observations and about 500 variables. About 20 variables are categorical, nominal, or datetime and the rest are numeric. I want to build a model using this dataset. I have a dependent variable that is binary (1 or 0) but there are too many dependent variables. I want to reduce the number of dependent variables to about 30. Someone suggested I use a random forest and an importance plot to find the 30 most importance variables. I have never used random forest before but I have a basic understanding of the theory behind it.

edit: Also someone suggested Chi-square for feature selection.

Could you please show me an efficient way to find the 30 most important variables. I am using SAS enterprise 7.1.

Any help would be great

Ksharp · Posted 08-01-2019 07:48 AM

I would recommend to use PROC HPGENSELECT .

or

PROC PLS + missing=em option (which could better handle missing value).

If your variable have many missing value ,try PROC PLS .(HPGENSELECT would drop these missing obs)

Using SAS to find the most important variables out of a large number of variables

Re: Using SAS to find the most important variables out of a large number of variables

Catch up on SAS Innovate 2026