BookmarkSubscribeRSS Feed
de95
Calcite | Level 5

Hello All,

 

I have a dataset with over 500k observations and about 500 variables. About 20 variables are categorical, nominal, or datetime and the rest are numeric. I want to build a model using this dataset. I have a dependent variable that is binary (1 or 0) but there are too many dependent variables. I want to reduce the number of dependent variables to about 30. Someone suggested I use a random forest and an importance plot to find the 30 most importance variables. I have never used random forest before but I have a basic understanding of the theory behind it.

 

edit: Also someone suggested Chi-square for feature selection.

 

Could you please show me an efficient way to find the 30 most important variables. I am using SAS enterprise 7.1.

 

Any help would be great

1 REPLY 1
Ksharp
Super User

I would recommend to use PROC HPGENSELECT .

 

or

 

PROC PLS + missing=em  option (which could better handle missing value).

 

If your variable have many missing value ,try PROC PLS .(HPGENSELECT would drop these missing obs)

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1125 views
  • 0 likes
  • 2 in conversation