<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/791243#M9041</link>
    <description>&lt;P&gt;I have datasets with 1 million observations and a mixture of variable types (i.e. categorical, interval etc.) Some datasets work great with decision trees - that is, where a larger proportion of data has the target variable "true"&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For example, my target variable is binary - 1 for true and 0 for false.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In some cases, as few as 0.2% of cases have the target as true. When running DTs for these datasets, EMiner will not attempt to prune.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;How do I get around this issue? I want to be able to find the things that split the whole dataset - so if I sample 10,000, where 10% have the true target variable and 90% don't, although I will find a split, it will be biased toward my biased 10,000 sample... i.e. i want to be able to say that 100% of people in my 1m have the target variable true if they are blonde and have size 3 feet etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is it simply not possible to use decision trees when you have such a small proportion of data that have the target variable?&lt;/P&gt;</description>
    <pubDate>Thu, 20 Jan 2022 19:17:36 GMT</pubDate>
    <dc:creator>EC27556</dc:creator>
    <dc:date>2022-01-20T19:17:36Z</dc:date>
    <item>
      <title>EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/791243#M9041</link>
      <description>&lt;P&gt;I have datasets with 1 million observations and a mixture of variable types (i.e. categorical, interval etc.) Some datasets work great with decision trees - that is, where a larger proportion of data has the target variable "true"&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For example, my target variable is binary - 1 for true and 0 for false.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In some cases, as few as 0.2% of cases have the target as true. When running DTs for these datasets, EMiner will not attempt to prune.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;How do I get around this issue? I want to be able to find the things that split the whole dataset - so if I sample 10,000, where 10% have the true target variable and 90% don't, although I will find a split, it will be biased toward my biased 10,000 sample... i.e. i want to be able to say that 100% of people in my 1m have the target variable true if they are blonde and have size 3 feet etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is it simply not possible to use decision trees when you have such a small proportion of data that have the target variable?&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jan 2022 19:17:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/791243#M9041</guid>
      <dc:creator>EC27556</dc:creator>
      <dc:date>2022-01-20T19:17:36Z</dc:date>
    </item>
    <item>
      <title>Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/791654#M9042</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I hope all your observations have the target variable, but not all your observations have the &lt;U&gt;&lt;STRONG&gt;target event&lt;/STRONG&gt;&lt;/U&gt;.&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;(Single) Decision trees might not be the best choice for modelling rare events.&lt;BR /&gt;But it can be done.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You need to oversample the rare event or under-sample the non-event, and then you need to use the Enterprise Miner Target Profiler such that the algorithm knows about the difference between the sample priors and the real priors.&lt;/P&gt;
&lt;P&gt;The priors are used for example to adjust the posterior probabilities for the real priors.&lt;/P&gt;
&lt;P&gt;See here :&lt;BR /&gt;SAS® Enterprise Miner™ 15.2 Reference Help&lt;BR /&gt;Enterprise Miner Target Profiler&lt;BR /&gt;&lt;A href="https://go.documentation.sas.com/doc/en/emref/15.2/n0z1mtvsscypjqn1ediv223jq5iy.htm" target="_blank"&gt;https://go.documentation.sas.com/doc/en/emref/15.2/n0z1mtvsscypjqn1ediv223jq5iy.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good luck,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
      <pubDate>Sat, 22 Jan 2022 18:23:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/791654#M9042</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2022-01-22T18:23:22Z</dc:date>
    </item>
    <item>
      <title>Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/792227#M9045</link>
      <description>&lt;P&gt;Ok, thanks, so in order of nodes it would be - data source - sample - target profiler - tree?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And how would the resulting tree look then?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Say I had 1m observations in total and 10k had the event true (1 in 100).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If I sampled so I had 90k where the event wasn't true (instead of 990k) and 10k where the event was true, how would the tree look? would the first node of the tree show 1=1% or 10%? Obviously I would like it to show 1% as that is the event proportion for the whole population.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jan 2022 17:08:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/EMINER-Decision-Tree-Analysis-when-only-a-SMALL-proportion-of/m-p/792227#M9045</guid>
      <dc:creator>EC27556</dc:creator>
      <dc:date>2022-01-25T17:08:48Z</dc:date>
    </item>
  </channel>
</rss>

