I have about 120 different attributes on my modelling data. So i would like to make a reduction on my variable set.
But i m not sure if am i have to employ two different variable selection node to my flow; one bypassing class variables and one bypassing interval variables and then connecting both of them to my modelling node.
I do so but i am receiving very low quantity of variables for my final modelling....
Is there a problem on my way of using variable selection node
There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable. If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...
... limited information in the predictor variables
... poorly conditioned input variables (perhaps a transformation of the variables would perform better)
... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)
... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)
... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)
In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables. Depending on the model, further selection might be possible. Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations. Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.
Not sure if the metadata of each of those nodes would get passed to your modeling node the way you intended. A quick way to check: click the Variables ellipsis for the model node and confirm that the role for your variables are what you expected.
Reading at the doc, it does not seem to me that you would need to pass a few variables at a time to get more variables selected. Variable Selection is doing distribution analysis and running a step-wise regression to keep the most important variables. You are good to pass all variables at once.
If you would like to try other methods for variable selection, simply connect any of the below before your modeling node. I am pretty sure all of them have variable selection option set to Yes by default.
Not sure what technique will work best for you. I guess it depends on the data. I use mostly Information Value or tree-based variable importance, but that is just my preference.
I hope it helps!
Miguel
also check proc corr Cronbach's coefficient alpha
Xia Keshan
There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable. If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...
... limited information in the predictor variables
... poorly conditioned input variables (perhaps a transformation of the variables would perform better)
... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)
... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)
... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)
In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables. Depending on the model, further selection might be possible. Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations. Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.