03-18-2014 03:59 PM
What is the best procedure to use if I want to do
1) Find the best variables to use in a model out of 30 and
2) Examine the best breaks or cutoffs once I find that variable ?
For example a score may be the best default predictor (dep var), and segmented at 300 500 and 650 .. etc. Thanks
03-18-2014 04:17 PM
I am assuming that you've identified variables which will be used as predictors. Proc varclus can identify variables which are loading heavily and explaining most of the variation. In that way you may select only some of the variables for further analysis even less than 30. In second phase use kmeans clustering to find best cutt-offs.
03-18-2014 04:45 PM
This is a data reduction concept and we try to reduce dimensionality of the data. Proc varclus apply principal components to identify group of variables which are highly correlated within their clusters but least correlated with other groups. Loadings means correlation between variable and the principal components. Please refer to following link for further details.
03-19-2014 12:51 PM
I was hoping to find a procedure that finds the best variables that are the most significant to the dependent varaible. If for example I have 20 varaibles and 1 dep var. I want to know which ones of the 20 variables are best in predicting the dep var.
03-19-2014 01:42 PM
Why? You have all 20 measures. Do you mean which variables are most closely correlated with the predicted value? Then you need to consider the role of moderating and mediating variables. Or do you mean which single variable is the best predictor? If so, again I ask, why? If you have all variables available, then to not use them is just, well, ignoring what you do have. Or do you mean which variable (or variables) are the most economical predictors, in the sense of future data? By economical, I mean those that lead to accurate predicted value for the least cost of measurement. I think you are concerned about building a predictive model. If so, subject matter expertise should enter as well as statistical considerations. Parsimony for the sake of parsimony alone will always lead to poor predictive models, just as over complexity can.
Use the methods outlined by @stat@sas above to get started. If you feel some sort of compulsion to try variable selection methods, look at LAR and LASSO methods in GLMSELECT. DO NOT USE STEPWISE, FORWARD, BACKWARD OR ALL POSSIBLE SUBSETS REGRESSION. These have been shown to produce biased results that lead to poor predictive models. Google "Flom Cassell" for more info, or read Frank Harrell's book on regression methods.
03-19-2014 01:54 PM
Great advice.. thanks..
To answer your questions, I have 20 variables as predictors, (for example time-on-books, FICO score, utilization, location, product, etc.) and 1 response variable (bad or not bad as in defaulted loans).. A business unit has asked me to create a chart of the Response Variable but segmented by the top 3 predictors. For example separate the bads/goods by Location and Product and FICO. It has to be the 3 best significant predictors. Similar to a decision tree.