BookmarkSubscribeRSS Feed
de95
Calcite | Level 5

Hello All,

 

I have a dataset with over 500k observations and about 500 variables. About 20 variables are categorical, nominal, or datetime and the rest are numeric. I want to build a model using this dataset. I have a dependent variable that is binary (1 or 0) but there are too many dependent variables. I want to reduce the number of dependent variables to about 30. Someone suggested I use a random forest and an importance plot to find the 30 most importance variables. I have never used random forest before but I have a basic understanding of the theory behind it.

 

edit: Also someone suggested Chi-square for feature selection.

 

Could you please show me an efficient way to find the 30 most important variables. I am using SAS enterprise 7.1.

 

Any help would be great

1 REPLY 1
Ksharp
Super User

I would recommend to use PROC HPGENSELECT .

 

or

 

PROC PLS + missing=em  option (which could better handle missing value).

 

If your variable have many missing value ,try PROC PLS .(HPGENSELECT would drop these missing obs)

hackathon24-white-horiz.png

Join the 2025 SAS Hackathon!

Calling all data scientists and open-source enthusiasts! Want to solve real problems that impact your company or the world? Register to hack by August 31st!

Register Now

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1121 views
  • 0 likes
  • 2 in conversation