turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- what is the optimal way to use variable selection ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-02-2014 09:53 PM

I have about 120 different attributes on my modelling data. So i would like to make a reduction on my variable set.

But i m not sure if am i have to employ two different variable selection node to my flow; one bypassing class variables and one bypassing interval variables and then connecting both of them to my modelling node.

I do so but i am receiving very low quantity of variables for my final modelling....

Is there a problem on my way of using variable selection node

Accepted Solutions

Solution

4 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

4 weeks ago

There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable. If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...

... limited information in the predictor variables

... poorly conditioned input variables (perhaps a transformation of the variables would perform better)

... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)

... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)

... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)

In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables. Depending on the model, further selection might be possible. Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations. Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-02-2014 11:15 PM

Not sure if the metadata of each of those nodes would get passed to your modeling node the way you intended. A quick way to check: click the Variables ellipsis for the model node and confirm that the role for your variables are what you expected.

Reading at the doc, it does not seem to me that you would need to pass a few variables at a time to get more variables selected. Variable Selection is doing distribution analysis and running a step-wise regression to keep the most important variables. You are good to pass all variables at once.

If you would like to try other methods for variable selection, simply connect any of the below before your modeling node. I am pretty sure all of them have variable selection option set to Yes by default.

- Tree, or tree ensemble nodes have variable selection turned on by default. Try a Decision Tree node, HPTree node, Gradient Boosting, or HPForest
- Partial Least Square and Survival nodes have variable selection options
- Interaction terms. Variable Selection and Regression node have options to test interactions. Set Use Interactions to Yes on the VS node. Set Two-Factor Interactions set to Yes in the Regression node.
- Gini or Information Value variable importance from the Interactive Grouping Node (licensed with Credit Scoring for SAS Enterprise Miner).

Not sure what technique will work best for you. I guess it depends on the data. I use mostly Information Value or tree-based variable importance, but that is just my preference.

I hope it helps!

Miguel

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-03-2014 08:53 AM

also check proc corr Cronbach's coefficient alpha

Xia Keshan

Solution

4 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

4 weeks ago

There are many ways to identify important variables including multiple options in the Variable Selection node depending on the measurement level of your target variable. If the variables that have been identified are not performing well, there could be many possible reasons contributing to the problem such as...

... limited information in the predictor variables

... poorly conditioned input variables (perhaps a transformation of the variables would perform better)

... mismatch between the selection method and the modeling method (e.g. it does not necessarily make sense to use a regression based linear variable selection technique when passing variables to a non-linear modeling algorithm like a Tree or Neural Network)

... lack of sufficient target signal (e.g. if you are modeling a rare event, it is possible that variables are being missed due to the criteria you are using for selecting them in which case oversampling and/or considering decision weights/priors might be of help)

... lack of model flexibility (e.g. using a regression without considering the possibility of higher order terms/interactions and/or considering more flexible modeling strategies)

In general, I strongly advocate using several different variable selection strategies including using multiple Variable Selection nodes with different settings and Decision Tree nodes to create a superset of possibly useful input variables. Depending on the model, further selection might be possible. Note that Decision Trees automatically select variables, Regression approaches optionally can use selection methods, and Random Forest models build Trees from subsets of variables as well as subsets of observations. Making sure you have not overly restricted the input variables but have considered possibly helpful binning and/or numeric transformations and are using sufficiently flexible modeling methods should help you to obtain the best possible predictions based on your data.