BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NKormanik
Barite | Level 11


1000 possible predictive variables; one target variable -- PROFIT.

 

Looking for a way to whittle down the possible predictive variables.

 

Ideally I'd like to end up with one single best.  Or a handful.

 

Last year someone suggested using HPSplit. The results obtained seemed inconclusive. Many, many parameters to guess at in that procedure, and I might have gotten most guesses wrong.

 

Lately I've been coming across lots of mention of XGBoost -- a new whiz kid on the algorithms block.

 

Wondering if you all would recommend using that?

 

Since XGBoost is not natively included in SAS 9.4, however, trying a built-in procedure would be preferred.

 

Any thoughts greatly appreciated.

 

Nicholas Kormanik

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
So learn statistic/probability theory to know the proper tool for the job .
@Rick_SAS blog is a good place or sas documentation .

View solution in original post

8 REPLIES 8
Reeza
Super User
You've already explored the "statistical basics" such as Principal Components and Variable Clustering?
Partial Least Squares Regression?

AFAIK XGBoost is a predictive algorithm (and has had good results IME along with LightGBM) not variable selection methodology.
Ksharp
Super User
As Reeza said, XGBoost or Decision Tree are predict model which are not suited to your data, due to your dependent variable PROFIT is continuous variable ,not category .
Try PROC PLS or PROC GENSELECT :

ods output VariableImportancePlot= VariableImportancePlot;
proc pls data=class missing=em nfac=3 plot=(ParmProfiles VIP) details; * cv=split cvtest(seed=12345);
class sex;
model age=weight height sex;
output out=x predicted=p;
run;


proc hpgenselect data=have ;
class birth_province sex shop_province ;
model profit = ..............
NKormanik
Barite | Level 11

I greatly appreciate all your suggestions and insights.

 

Seemingly a straight forward problem.  Yet, if the proper tool for the job is not known, one can be stumped forever.

 

 

Ksharp
Super User
So learn statistic/probability theory to know the proper tool for the job .
@Rick_SAS blog is a good place or sas documentation .
NKormanik
Barite | Level 11
Agree, @Ksharp. @Rick_SAS blogs are very informative. The rub always is, problem at hand is a bit different from the example given. Or, there's an option that who-the-hell knows what to do with it.

If the world were just cookie-cutter clear....
PaigeMiller
Diamond | Level 26

Strongly recommend PROC PLS, which does not require you to "whittle down" the number of predictor variables. In this paper, the author takes 1000 predictor variables, many of which are highly correlated with another, and creates a useful predictive model without the variable selection step. Please note: the syntax for PROC PLS has changed since that paper was written.

--
Paige Miller
PaigeMiller
Diamond | Level 26

@PaigeMiller wrote:

Strongly recommend PROC PLS, which does not require you to "whittle down" the number of predictor variables. In this paper, the author takes 1000 predictor variables, many of which are highly correlated with another, and creates a useful predictive model without the variable selection step. Please note: the syntax for PROC PLS has changed since that paper was written.


Adding to the above ... PLS is surprisingly robust against multi-collinearity among the X variables, which enables the author to skip the variable selection step. And PLS deserves more attention and more use; although there are probably a thousand published papers now where PLS has been used successfully, it is not widely known amongst data practitioners, and it should be widely known!

--
Paige Miller
SteveDenham
Jade | Level 19

You might want to listen in on @sasmlp 's webinar (see the announcement in this community) where high-dimensional variable selection in SAS will be covered (likely this will emphasize HPGENSELECT).

 

SteveDenham

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1597 views
  • 8 likes
  • 5 in conversation