<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Clarification on Variable Clustering in SAS Academy for Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Clarification-on-Variable-Clustering/m-p/654276#M884</link>
    <description>&lt;P&gt;Re: Predictive Modeling Using Logistic Regression&lt;/P&gt;
&lt;P&gt;With regard to using Variable Clustering as a way of dealing with input redundancy (page 3.40 of course text):&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Q1.&lt;/STRONG&gt; Does it make sense to include binary variables (including those from&amp;nbsp;categorical variables) when running Variable Clustering: should I include missing indicators/? My concern is that those variables may negatively affect the way Proc Varclus defines the clusters, moreover, how should we interpret the resuls if, for instance, the dummy variables from a categorical input are spread across different clusters? I feel that categorical/binary variables, by their very nature, are better screened based on relevancy, using methods such as Chi-Square or Variable Importance from a Decision Tree.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Q2.&lt;/STRONG&gt; Is it not too restrictive to only select 1 variable from each cluster? In case, how would I select 2 variables from each cluster: would it make sense to pick those related to lowest and highest "1-R2” ratio?&lt;/P&gt;</description>
    <pubDate>Mon, 08 Jun 2020 08:04:25 GMT</pubDate>
    <dc:creator>pvareschi</dc:creator>
    <dc:date>2020-06-08T08:04:25Z</dc:date>
    <item>
      <title>Clarification on Variable Clustering</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Clarification-on-Variable-Clustering/m-p/654276#M884</link>
      <description>&lt;P&gt;Re: Predictive Modeling Using Logistic Regression&lt;/P&gt;
&lt;P&gt;With regard to using Variable Clustering as a way of dealing with input redundancy (page 3.40 of course text):&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Q1.&lt;/STRONG&gt; Does it make sense to include binary variables (including those from&amp;nbsp;categorical variables) when running Variable Clustering: should I include missing indicators/? My concern is that those variables may negatively affect the way Proc Varclus defines the clusters, moreover, how should we interpret the resuls if, for instance, the dummy variables from a categorical input are spread across different clusters? I feel that categorical/binary variables, by their very nature, are better screened based on relevancy, using methods such as Chi-Square or Variable Importance from a Decision Tree.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Q2.&lt;/STRONG&gt; Is it not too restrictive to only select 1 variable from each cluster? In case, how would I select 2 variables from each cluster: would it make sense to pick those related to lowest and highest "1-R2” ratio?&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jun 2020 08:04:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Clarification-on-Variable-Clustering/m-p/654276#M884</guid>
      <dc:creator>pvareschi</dc:creator>
      <dc:date>2020-06-08T08:04:25Z</dc:date>
    </item>
    <item>
      <title>Re: Clarification on Variable Clustering</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Clarification-on-Variable-Clustering/m-p/656460#M899</link>
      <description>&lt;P&gt;PROC VARCLUS is used in this course for dimension reduction, specifically to reduce the number of redundant variables. We recommend using the R-square with its own cluster, the R-square with the next closest cluster, and the 1 - R-square ratio. The inclusion of the binary variables might comprise the inferences, but I do not think they will bias the R-square statistics to any great extent. We recommend that you include the missing indicator variables. The demonstration shows that binary variables can be highly correlated, so if you have many binary variables, we recommend that you reduce the redundancy. Including all the binary variables, especially when they are highly correlated, in the subset selection methods in PROC LOGISTIC can be problematic.&lt;/P&gt;
&lt;P&gt;Choosing more than one variable in a cluster is fine if the variables are not highly correlated. Including highly correlated variables can cause problems when you are eliminating irrelevant variables in PROC LOGISTIC. I recommend examining the R-square statistics in PROC VARCLUS to make that determination.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2020 16:00:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Clarification-on-Variable-Clustering/m-p/656460#M899</guid>
      <dc:creator>sasmlp</dc:creator>
      <dc:date>2020-06-10T16:00:19Z</dc:date>
    </item>
  </channel>
</rss>

