<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic HPSPLIT Grow Statement for Imbalanced Data in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/HPSPLIT-Grow-Statement-for-Imbalanced-Data/m-p/308109#M4625</link>
    <description>&lt;P&gt;I am using the HPSLIT command to run a classification tree. In the Grow statement, I have used "entropy." However, I recently learned that this may be sensitive to imbalanced data. One of my outcome groups is almost double in size compared to the other. Does anyone have suggestions for which of the other Grow options (&lt;SPAN&gt;CHAID, CHISQUARE, FASTCHAID, and GINI) may be less sensitive to imbalanced data? Thank you!&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 30 Oct 2016 01:22:21 GMT</pubDate>
    <dc:creator>smb11</dc:creator>
    <dc:date>2016-10-30T01:22:21Z</dc:date>
    <item>
      <title>HPSPLIT Grow Statement for Imbalanced Data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/HPSPLIT-Grow-Statement-for-Imbalanced-Data/m-p/308109#M4625</link>
      <description>&lt;P&gt;I am using the HPSLIT command to run a classification tree. In the Grow statement, I have used "entropy." However, I recently learned that this may be sensitive to imbalanced data. One of my outcome groups is almost double in size compared to the other. Does anyone have suggestions for which of the other Grow options (&lt;SPAN&gt;CHAID, CHISQUARE, FASTCHAID, and GINI) may be less sensitive to imbalanced data? Thank you!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 30 Oct 2016 01:22:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/HPSPLIT-Grow-Statement-for-Imbalanced-Data/m-p/308109#M4625</guid>
      <dc:creator>smb11</dc:creator>
      <dc:date>2016-10-30T01:22:21Z</dc:date>
    </item>
    <item>
      <title>Re: HPSPLIT Grow Statement for Imbalanced Data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/HPSPLIT-Grow-Statement-for-Imbalanced-Data/m-p/308373#M4633</link>
      <description>&lt;P&gt;Both Entropy and Gini can be sensitive to unbalanced data, as the value for the node purity is based off of the proportion of observations in the node with the different response levels. Usually this is a larger problem in rare event modeling. One outcome group being twice the size of another is not as likely to be a large issue.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Additionally, CHAID and FastCHAID both should be less sensitive to the data with imbalanced numbers of each outcome group than Entropy and Gini. That being said, if the imbalance is too large, it might be better practice to oversample the data beforehand.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have the time and the arrangement, I would recommend building several decision trees using different criterion, and then use validation data to determine the best tree.&lt;/P&gt;</description>
      <pubDate>Mon, 31 Oct 2016 19:57:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/HPSPLIT-Grow-Statement-for-Imbalanced-Data/m-p/308373#M4633</guid>
      <dc:creator>RalphAbbey</dc:creator>
      <dc:date>2016-10-31T19:57:16Z</dc:date>
    </item>
  </channel>
</rss>

