Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- find the best variables to use and best segmentati...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-18-2014 03:59 PM

What is the best procedure to use if I want to do

1) Find the best variables to use in a model out of 30 and

2) Examine the best breaks or cutoffs once I find that variable ?

For example a score may be the best default predictor (dep var), and segmented at 300 500 and 650 .. etc. Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-18-2014 04:17 PM

Hi,

I am assuming that you've identified variables which will be used as predictors. Proc varclus can identify variables which are loading heavily and explaining most of the variation. In that way you may select only some of the variables for further analysis even less than 30. In second phase use kmeans clustering to find best cutt-offs.

Thanks,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-18-2014 04:29 PM

Thank you for the response.. this is very helpful.. What do you mean by loading heavily?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-18-2014 04:45 PM

This is a data reduction concept and we try to reduce dimensionality of the data. Proc varclus apply principal components to identify group of variables which are highly correlated within their clusters but least correlated with other groups. Loadings means correlation between variable and the principal components. Please refer to following link for further details.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-19-2014 12:51 PM

I was hoping to find a procedure that finds the best variables that are the most significant to the dependent varaible. If for example I have 20 varaibles and 1 dep var. I want to know which ones of the 20 variables are best in predicting the dep var.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-19-2014 01:42 PM

Why? You have all 20 measures. Do you mean which variables are most closely correlated with the predicted value? Then you need to consider the role of moderating and mediating variables. Or do you mean which single variable is the best predictor? If so, again I ask, why? If you have all variables available, then to not use them is just, well, ignoring what you do have. Or do you mean which variable (or variables) are the most economical predictors, in the sense of future data? By economical, I mean those that lead to accurate predicted value for the least cost of measurement. I think you are concerned about building a predictive model. If so, subject matter expertise should enter as well as statistical considerations. Parsimony for the sake of parsimony alone will always lead to poor predictive models, just as over complexity can.

Use the methods outlined by @stat@sas above to get started. If you feel some sort of compulsion to try variable selection methods, look at LAR and LASSO methods in GLMSELECT. DO NOT USE STEPWISE, FORWARD, BACKWARD OR ALL POSSIBLE SUBSETS REGRESSION. These have been shown to produce biased results that lead to poor predictive models. Google "Flom Cassell" for more info, or read Frank Harrell's book on regression methods.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-19-2014 01:54 PM

Great advice.. thanks..

To answer your questions, I have 20 variables as predictors, (for example time-on-books, FICO score, utilization, location, product, etc.) and 1 response variable (bad or not bad as in defaulted loans).. A business unit has asked me to create a chart of the Response Variable but segmented by the top 3 predictors. For example separate the bads/goods by Location and Product and FICO. It has to be the 3 best significant predictors. Similar to a decision tree.