<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Gradient Boosting is performing worse than random - Help please in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135806#M1247</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Wendy.&amp;nbsp; I'm using inverse priors in the decision matrix, so would the miss classification rate of, lets say a decision tree take into account that the data is sampled.&amp;nbsp; Here's the situation driving my question:&amp;nbsp; In situations where I deal with rare events (event happens in 5% of data), I'll sometimes get a missclass. rate of lets say,15% on validation data.&amp;nbsp; I then try oversampling (w/inverse priors of course), increasing the event proportion from 5% to (10%, or 20%, or 30%, ect.) and I end up getting missclass rates higher than the original 15%.&amp;nbsp; Is there a way to compare against different subsampling proportions?&amp;nbsp; SAS's training material usually suggests oversampling in situations of rare events, but I've been experiencing worse results when I do this.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 14 Mar 2014 17:46:59 GMT</pubDate>
    <dc:creator>Analyze_this</dc:creator>
    <dc:date>2014-03-14T17:46:59Z</dc:date>
    <item>
      <title>Gradient Boosting is performing worse than random - Help please</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135803#M1244</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello SASers,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am working on a project with a binary target.&amp;nbsp; The target distribution is 13.6% (event) vs 86.4% (non-event).&amp;nbsp; The decision tree, regression and gradient boosting models are scoring around a 19% missclassification on the validation data.&amp;nbsp; I have two questions, but here are some of the details of my process flow: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried using inverse priors with models' assessment statistic set to decision, but switched to missclassification after I realized the models performed marginally better under this setting.&lt;/P&gt;&lt;P&gt;Data partition node is set to 70% (train) and 30% (validation).&lt;/P&gt;&lt;P&gt;I tried oversampling event case to 33% of the data, but the missclassification rate rose to 20%.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;First question:&amp;nbsp; If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)?&amp;nbsp; OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?&amp;nbsp;&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Second question: Do y'all have any suggestions of what is casing the models to perform worse than random and how suggestions of how I may fix the problem?&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank y'all so much for your time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt; RWB&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 06 Mar 2014 20:58:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135803#M1244</guid>
      <dc:creator>Analyze_this</dc:creator>
      <dc:date>2014-03-06T20:58:54Z</dc:date>
    </item>
    <item>
      <title>Re: Gradient Boosting is performing worse than random - Help please</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135804#M1245</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Oops, I made a rookie mistake.&amp;nbsp; I calculated the distribution from the histograms derived from the explore variable process and I forgot to change my settings from (Top,Default) to (Random,Max).&amp;nbsp; In actuality, the target distribution is around &lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;target distribution is 30% (event) vs 70% (non-event).&amp;nbsp; So the model's are adding to our prediction power.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm still curious about the first question I asked above.&amp;nbsp; I'll restate it:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;First question:&amp;nbsp; If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)?&amp;nbsp; OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;If y'all could help me solve this one, that would be great.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;Thank you.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 07 Mar 2014 14:56:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135804#M1245</guid>
      <dc:creator>Analyze_this</dc:creator>
      <dc:date>2014-03-07T14:56:15Z</dc:date>
    </item>
    <item>
      <title>Re: Gradient Boosting is performing worse than random - Help please</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135805#M1246</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;No, oversampling is not being accounted for unless you adjust your prior probabilities and/or decision matrix, either in the Input Data node or a Decisions node after you have sampled.&amp;nbsp; The "Detecting Rare Classes" section under Analytics &amp;gt; Predictive Modeling in the Enterprise Miner Reference Help provides the best practices for handling rare events.&lt;/P&gt;&lt;P&gt;Hope that helps,&lt;/P&gt;&lt;P&gt;Wendy Czika&lt;/P&gt;&lt;P&gt;SAS Enterprise Miner R&amp;amp;D&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 14 Mar 2014 16:26:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135805#M1246</guid>
      <dc:creator>WendyCzika</dc:creator>
      <dc:date>2014-03-14T16:26:59Z</dc:date>
    </item>
    <item>
      <title>Re: Gradient Boosting is performing worse than random - Help please</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135806#M1247</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Wendy.&amp;nbsp; I'm using inverse priors in the decision matrix, so would the miss classification rate of, lets say a decision tree take into account that the data is sampled.&amp;nbsp; Here's the situation driving my question:&amp;nbsp; In situations where I deal with rare events (event happens in 5% of data), I'll sometimes get a missclass. rate of lets say,15% on validation data.&amp;nbsp; I then try oversampling (w/inverse priors of course), increasing the event proportion from 5% to (10%, or 20%, or 30%, ect.) and I end up getting missclass rates higher than the original 15%.&amp;nbsp; Is there a way to compare against different subsampling proportions?&amp;nbsp; SAS's training material usually suggests oversampling in situations of rare events, but I've been experiencing worse results when I do this.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 14 Mar 2014 17:46:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135806#M1247</guid>
      <dc:creator>Analyze_this</dc:creator>
      <dc:date>2014-03-14T17:46:59Z</dc:date>
    </item>
    <item>
      <title>Re: Gradient Boosting is performing worse than random - Help please</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135807#M1248</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I'm unclear about what you are doing exactly when you say oversampling with inverse priors.&amp;nbsp; If you are using the Sample node to sample a higher proportion of rare events, then you would need a Decisions node following it to adjust the prior probabilities.&amp;nbsp; When using the same prior probabilities, it is valid to compare the models with different event proportions from oversampling.&amp;nbsp; The "Prior Probabilities" section of the same part of the EM Reference Help that I mentioned above explains this better than I can!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 14 Mar 2014 19:27:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Gradient-Boosting-is-performing-worse-than-random-Help-please/m-p/135807#M1248</guid>
      <dc:creator>WendyCzika</dc:creator>
      <dc:date>2014-03-14T19:27:31Z</dc:date>
    </item>
  </channel>
</rss>

