Solved: Clarification on Variable Clustering

pvareschi · Posted 06-08-2020 04:04 AM

Re: Predictive Modeling Using Logistic Regression

With regard to using Variable Clustering as a way of dealing with input redundancy (page 3.40 of course text):

Q1. Does it make sense to include binary variables (including those from categorical variables) when running Variable Clustering: should I include missing indicators/? My concern is that those variables may negatively affect the way Proc Varclus defines the clusters, moreover, how should we interpret the resuls if, for instance, the dummy variables from a categorical input are spread across different clusters? I feel that categorical/binary variables, by their very nature, are better screened based on relevancy, using methods such as Chi-Square or Variable Importance from a Decision Tree.

Q2. Is it not too restrictive to only select 1 variable from each cluster? In case, how would I select 2 variables from each cluster: would it make sense to pick those related to lowest and highest "1-R2” ratio?

sasmlp · Posted 06-10-2020 12:00 PM

PROC VARCLUS is used in this course for dimension reduction, specifically to reduce the number of redundant variables. We recommend using the R-square with its own cluster, the R-square with the next closest cluster, and the 1 - R-square ratio. The inclusion of the binary variables might comprise the inferences, but I do not think they will bias the R-square statistics to any great extent. We recommend that you include the missing indicator variables. The demonstration shows that binary variables can be highly correlated, so if you have many binary variables, we recommend that you reduce the redundancy. Including all the binary variables, especially when they are highly correlated, in the subset selection methods in PROC LOGISTIC can be problematic.

Choosing more than one variable in a cluster is fine if the variables are not highly correlated. Including highly correlated variables can cause problems when you are eliminating irrelevant variables in PROC LOGISTIC. I recommend examining the R-square statistics in PROC VARCLUS to make that determination.

View solution in original post

sasmlp · Posted 06-10-2020 12:00 PM

PROC VARCLUS is used in this course for dimension reduction, specifically to reduce the number of redundant variables. We recommend using the R-square with its own cluster, the R-square with the next closest cluster, and the 1 - R-square ratio. The inclusion of the binary variables might comprise the inferences, but I do not think they will bias the R-square statistics to any great extent. We recommend that you include the missing indicator variables. The demonstration shows that binary variables can be highly correlated, so if you have many binary variables, we recommend that you reduce the redundancy. Including all the binary variables, especially when they are highly correlated, in the subset selection methods in PROC LOGISTIC can be problematic.

Choosing more than one variable in a cluster is fine if the variables are not highly correlated. Including highly correlated variables can cause problems when you are eliminating irrelevant variables in PROC LOGISTIC. I recommend examining the R-square statistics in PROC VARCLUS to make that determination.

Clarification on Variable Clustering

Re: Clarification on Variable Clustering

Re: Clarification on Variable Clustering

Clarification on Variable Clustering

Re: Clarification on Variable Clustering

Re: Clarification on Variable Clustering

Click image to register for webinar

Classroom Training Available!