BookmarkSubscribeRSS Feed
de95
Calcite | Level 5

Hello All,

 

I have a dataset with over 500k observations and about 500 variables. About 20 variables are categorical, nominal, or datetime and the rest are numeric. I want to build a model using this dataset. I have a dependent variable that is binary (1 or 0) but there are too many dependent variables. I want to reduce the number of dependent variables to about 30. Someone suggested I use a random forest and an importance plot to find the 30 most importance variables. I have never used random forest before but I have a basic understanding of the theory behind it.

 

edit: Also someone suggested Chi-square for feature selection.

 

Could you please show me an efficient way to find the 30 most important variables. I am using SAS enterprise 7.1.

 

Any help would be great

1 REPLY 1
Ksharp
Super User

I would recommend to use PROC HPGENSELECT .

 

or

 

PROC PLS + missing=em  option (which could better handle missing value).

 

If your variable have many missing value ,try PROC PLS .(HPGENSELECT would drop these missing obs)

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1183 views
  • 0 likes
  • 2 in conversation