<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Clustering with Too Many Variables in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Clustering-with-Too-Many-Variables/m-p/235633#M3361</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have 230 variables and 15.000 observations in my dataset. 30 of the variables are categorical. My goal is to find meaningful clusters out of this population by using SAS EM Clustering Node.&lt;/P&gt;&lt;P&gt;These are the steps that I apply before clustering.&lt;/P&gt;&lt;P&gt;- Outlier elimination&lt;BR /&gt;- Missing value imputation&lt;BR /&gt;- Encoding categorical variables ( by creating dummy binary variables )&lt;/P&gt;&lt;P&gt;I have&amp;nbsp;4 questions:&lt;/P&gt;&lt;P&gt;1. Do you recommend any other analyses in order to obtain better results ?&amp;nbsp;&amp;nbsp;&lt;BR /&gt;2. "Incorporating the categorical variables in clustering&amp;nbsp; by binarizing them" is the best way to use them?&lt;BR /&gt;3. As far as I researched, the number of my variables is too many for clustering. So as a next step,&amp;nbsp; I need to reduce the number of input variables.&lt;BR /&gt;I tried applying 'Principal Components' and 'Variable Clustering' before the 'Clustering'.&amp;nbsp; I ended up with 2 different clusters but I'm having troubles to interpret these clusters.&lt;BR /&gt;When I check the output of 'Segment Profile' node, I see the distributions of either variable clusters or principal components as. How can I know which components are related to which variables?&lt;/P&gt;&lt;P&gt;4. How do I asses the results of clustering ?&lt;BR /&gt;Thanks in advance&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Gorkem&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 20 Nov 2015 10:05:45 GMT</pubDate>
    <dc:creator>gorkemkilic</dc:creator>
    <dc:date>2015-11-20T10:05:45Z</dc:date>
    <item>
      <title>Clustering with Too Many Variables</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Clustering-with-Too-Many-Variables/m-p/235633#M3361</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have 230 variables and 15.000 observations in my dataset. 30 of the variables are categorical. My goal is to find meaningful clusters out of this population by using SAS EM Clustering Node.&lt;/P&gt;&lt;P&gt;These are the steps that I apply before clustering.&lt;/P&gt;&lt;P&gt;- Outlier elimination&lt;BR /&gt;- Missing value imputation&lt;BR /&gt;- Encoding categorical variables ( by creating dummy binary variables )&lt;/P&gt;&lt;P&gt;I have&amp;nbsp;4 questions:&lt;/P&gt;&lt;P&gt;1. Do you recommend any other analyses in order to obtain better results ?&amp;nbsp;&amp;nbsp;&lt;BR /&gt;2. "Incorporating the categorical variables in clustering&amp;nbsp; by binarizing them" is the best way to use them?&lt;BR /&gt;3. As far as I researched, the number of my variables is too many for clustering. So as a next step,&amp;nbsp; I need to reduce the number of input variables.&lt;BR /&gt;I tried applying 'Principal Components' and 'Variable Clustering' before the 'Clustering'.&amp;nbsp; I ended up with 2 different clusters but I'm having troubles to interpret these clusters.&lt;BR /&gt;When I check the output of 'Segment Profile' node, I see the distributions of either variable clusters or principal components as. How can I know which components are related to which variables?&lt;/P&gt;&lt;P&gt;4. How do I asses the results of clustering ?&lt;BR /&gt;Thanks in advance&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Gorkem&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Nov 2015 10:05:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Clustering-with-Too-Many-Variables/m-p/235633#M3361</guid>
      <dc:creator>gorkemkilic</dc:creator>
      <dc:date>2015-11-20T10:05:45Z</dc:date>
    </item>
    <item>
      <title>Re: Clustering with Too Many Variables</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Clustering-with-Too-Many-Variables/m-p/387040#M5743</link>
      <description>&lt;P style="margin: 0in 0in 8pt;"&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;You're already doing some useful preprocessing, handling missing values and taking care of collinearity. Clustering is as much art as science, so there are many different pre- and post-processing tools that can be useful.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="margin: 0in 0in 8pt;"&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;First, I'd check that you've got variables that provide useful information about the segments you are interested in. Since you don't have a target variable in clustering, relevance of the inputs is determined based on your domain knowledge. Eliminate any that don't clearly have anything to do with your desired segments. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="margin: 0in 0in 8pt;"&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;It sounds like you will want to interpret the clusters, which sends you down tone of two different paths to handle collinearity. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="margin: 0in 0in 8pt;"&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;One path is the way you went, by performing PCA, and using the PCs as input to the cluster analysis. Then, when you use the segment profile node, set the PC variables to not be used, but set the original input variables to be used instead. This will enable you to make sense of the clusters in terms of the original variables, even though the PCs were used for deriving clusters. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="margin: 0in 0in 8pt;"&gt;&lt;FONT color="#000000" face="Calibri" size="3"&gt;The other path you can take is to select exemplar variables from the variable clustering, instead of using variable cluster scores. When you do this, the cluster analysis is based on a reduced number of input variables, which are still somewhat correlated. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;I hope this helps!&lt;/P&gt;
&lt;P&gt;Cat&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2017 14:51:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Clustering-with-Too-Many-Variables/m-p/387040#M5743</guid>
      <dc:creator>CatTruxillo</dc:creator>
      <dc:date>2017-08-10T14:51:32Z</dc:date>
    </item>
  </channel>
</rss>

