BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
omerzeybek
Obsidian | Level 7

I have about 120 different attributes on my modelling data. So i would like to make a reduction on my variable set.

But i m not sure if am i have to employ two different variable selection node to my flow; one bypassing class variables and one bypassing interval variables and then connecting both of them to my modelling node.

I do so but i am receiving very low quantity of variables for my final modelling....

Is there a problem on my way of using variable selection node

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable.   If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...

... limited information in the predictor variables

... poorly conditioned input variables (perhaps a transformation of the variables would perform better)

... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)

... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)

... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)

 

In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables.  Depending on the model, further selection might be possible.  Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations.   Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.  

View solution in original post

3 REPLIES 3
M_Maldonado
Barite | Level 11

Not sure if the metadata of each of those nodes would get passed to your modeling node the way you intended. A quick way to check: click the Variables ellipsis for the model node and confirm that the role for your variables are what you expected.

Reading at the doc, it does not seem to me that you would need to pass a few variables at a time to get more variables selected. Variable Selection is doing distribution analysis and running a step-wise regression to keep the most important variables. You are good to pass all variables at once.

If you would like to try other methods for variable selection, simply connect any of the below before your modeling node. I am pretty sure all of them have variable selection option set to Yes by default.

  • Tree, or tree ensemble nodes have variable selection turned on by default. Try a Decision Tree node, HPTree node, Gradient Boosting, or HPForest
  • Partial Least Square and Survival nodes have variable selection options
  • Interaction terms. Variable Selection and Regression node have options to test interactions. Set Use Interactions to Yes on the VS node. Set Two-Factor Interactions set to Yes in the Regression node.
  • Gini or Information Value variable importance from the Interactive Grouping Node (licensed with Credit Scoring for SAS Enterprise Miner).

Not sure what technique will work best for you. I guess it depends on the data. I use mostly Information Value or tree-based variable importance, but that is just my preference.

I hope it helps!

Miguel

Ksharp
Super User

also check proc corr  Cronbach's coefficient alpha

Xia Keshan

DougWielenga
SAS Employee

There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable.   If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...

... limited information in the predictor variables

... poorly conditioned input variables (perhaps a transformation of the variables would perform better)

... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)

... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)

... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)

 

In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables.  Depending on the model, further selection might be possible.  Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations.   Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.  

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 4403 views
  • 4 likes
  • 4 in conversation