Re: Predictive Modeling Using Logistic Regression
With regard to using Variable Clustering as a way of dealing with input redundancy (page 3.40 of course text):
Q1. Does it make sense to include binary variables (including those from categorical variables) when running Variable Clustering: should I include missing indicators/? My concern is that those variables may negatively affect the way Proc Varclus defines the clusters, moreover, how should we interpret the resuls if, for instance, the dummy variables from a categorical input are spread across different clusters? I feel that categorical/binary variables, by their very nature, are better screened based on relevancy, using methods such as Chi-Square or Variable Importance from a Decision Tree.
Q2. Is it not too restrictive to only select 1 variable from each cluster? In case, how would I select 2 variables from each cluster: would it make sense to pick those related to lowest and highest "1-R2” ratio?