1000 possible predictive variables; one target variable -- PROFIT.
Looking for a way to whittle down the possible predictive variables.
Ideally I'd like to end up with one single best. Or a handful.
Last year someone suggested using HPSplit. The results obtained seemed inconclusive. Many, many parameters to guess at in that procedure, and I might have gotten most guesses wrong.
Lately I've been coming across lots of mention of XGBoost -- a new whiz kid on the algorithms block.
Wondering if you all would recommend using that?
Since XGBoost is not natively included in SAS 9.4, however, trying a built-in procedure would be preferred.
Any thoughts greatly appreciated.
Nicholas Kormanik
I greatly appreciate all your suggestions and insights.
Seemingly a straight forward problem. Yet, if the proper tool for the job is not known, one can be stumped forever.
Strongly recommend PROC PLS, which does not require you to "whittle down" the number of predictor variables. In this paper, the author takes 1000 predictor variables, many of which are highly correlated with another, and creates a useful predictive model without the variable selection step. Please note: the syntax for PROC PLS has changed since that paper was written.
@PaigeMiller wrote:
Strongly recommend PROC PLS, which does not require you to "whittle down" the number of predictor variables. In this paper, the author takes 1000 predictor variables, many of which are highly correlated with another, and creates a useful predictive model without the variable selection step. Please note: the syntax for PROC PLS has changed since that paper was written.
Adding to the above ... PLS is surprisingly robust against multi-collinearity among the X variables, which enables the author to skip the variable selection step. And PLS deserves more attention and more use; although there are probably a thousand published papers now where PLS has been used successfully, it is not widely known amongst data practitioners, and it should be widely known!
You might want to listen in on @sasmlp 's webinar (see the announcement in this community) where high-dimensional variable selection in SAS will be covered (likely this will emphasize HPGENSELECT).
SteveDenham
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.