<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Effect of oversampling on fitted model in SAS Academy for Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Effect-of-oversampling-on-fitted-model/m-p/631795#M561</link>
    <description>&lt;P&gt;Lesson "&lt;FONT&gt;What Is Separate Sampling?&lt;/FONT&gt;" under "&lt;FONT&gt;Lesson 7: Model Assessment Using SAS Enterprise Miner&lt;/FONT&gt;" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.&lt;/P&gt;&lt;P&gt;However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models&amp;nbsp;&lt;SPAN style="background-color: #ffffff; color: #333333; cursor: text; display: inline; float: none; font-family: &amp;amp;quot; helevticaneue-light&amp;amp;quot;,&amp;amp;quot;helvetica neue&amp;amp;quot;,helvetica,arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.2; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;(full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%)&lt;/SPAN&gt; &lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;has shown the following results:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;&lt;SPAN style="background-color: #ffffff; color: #333333; cursor: text; display: inline; float: none; font-family: &amp;amp;quot; helevticaneue-light&amp;amp;quot;,&amp;amp;quot;helvetica neue&amp;amp;quot;,helvetica,arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.2; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate to hear other opinions on the above.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 13 Mar 2020 07:30:25 GMT</pubDate>
    <dc:creator>pvareschi</dc:creator>
    <dc:date>2020-03-13T07:30:25Z</dc:date>
    <item>
      <title>Effect of oversampling on fitted model</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Effect-of-oversampling-on-fitted-model/m-p/631795#M561</link>
      <description>&lt;P&gt;Lesson "&lt;FONT&gt;What Is Separate Sampling?&lt;/FONT&gt;" under "&lt;FONT&gt;Lesson 7: Model Assessment Using SAS Enterprise Miner&lt;/FONT&gt;" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.&lt;/P&gt;&lt;P&gt;However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models&amp;nbsp;&lt;SPAN style="background-color: #ffffff; color: #333333; cursor: text; display: inline; float: none; font-family: &amp;amp;quot; helevticaneue-light&amp;amp;quot;,&amp;amp;quot;helvetica neue&amp;amp;quot;,helvetica,arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.2; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;(full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%)&lt;/SPAN&gt; &lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;has shown the following results:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;&lt;SPAN style="background-color: #ffffff; color: #333333; cursor: text; display: inline; float: none; font-family: &amp;amp;quot; helevticaneue-light&amp;amp;quot;,&amp;amp;quot;helvetica neue&amp;amp;quot;,helvetica,arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.2; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: 'HelevticaNeue-light','Helvetica Neue',Helvetica,Arial,sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;"&gt;Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate to hear other opinions on the above.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Mar 2020 07:30:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Effect-of-oversampling-on-fitted-model/m-p/631795#M561</guid>
      <dc:creator>pvareschi</dc:creator>
      <dc:date>2020-03-13T07:30:25Z</dc:date>
    </item>
    <item>
      <title>Re: Effect of oversampling on fitted model</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Effect-of-oversampling-on-fitted-model/m-p/653985#M872</link>
      <description>I agree with your final conclusion. In building predictive models we are after the champion model that can score new data more accurately. (The goal is different form using SurveyLogistc for population survey model or Proc Logistic for inferential statistics.&lt;BR /&gt;Performing separate sampling or oversampling is similar to using equal number of replicates in clinical studies. Therefore my recommendation is try balanced over sample when your target variable is a rare event.</description>
      <pubDate>Sun, 07 Jun 2020 02:45:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/Effect-of-oversampling-on-fitted-model/m-p/653985#M872</guid>
      <dc:creator>gcjfernandez</dc:creator>
      <dc:date>2020-06-07T02:45:51Z</dc:date>
    </item>
  </channel>
</rss>

