<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Change to Oversampling seed creates different results. in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Change-to-Oversampling-seed-creates-different-results/m-p/43095#M245</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am using Enterprise Miner 7.1 to create a response model for a direct response marketing campaign.&amp;nbsp; My sample data consists of about 69,000 records with a response rate of 0.8%. I am oversample to 40% response 60% non-response. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying differenct transformation techniques and modeling techniques and using the model comparrsion node to choose a model.&amp;nbsp; Out of curiosity I change the seed in my sample node the creates my oversample.&amp;nbsp; When I did this I saw changes to what models were being selected and what variables were being selected in those models. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is causing this and should I be concerned about it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I would greatly appreciate any thoughts.&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you in advance.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Mon, 26 Mar 2012 16:59:44 GMT</pubDate>
    <dc:creator>mstell</dc:creator>
    <dc:date>2012-03-26T16:59:44Z</dc:date>
    <item>
      <title>Change to Oversampling seed creates different results.</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Change-to-Oversampling-seed-creates-different-results/m-p/43095#M245</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am using Enterprise Miner 7.1 to create a response model for a direct response marketing campaign.&amp;nbsp; My sample data consists of about 69,000 records with a response rate of 0.8%. I am oversample to 40% response 60% non-response. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying differenct transformation techniques and modeling techniques and using the model comparrsion node to choose a model.&amp;nbsp; Out of curiosity I change the seed in my sample node the creates my oversample.&amp;nbsp; When I did this I saw changes to what models were being selected and what variables were being selected in those models. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is causing this and should I be concerned about it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I would greatly appreciate any thoughts.&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you in advance.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 26 Mar 2012 16:59:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Change-to-Oversampling-seed-creates-different-results/m-p/43095#M245</guid>
      <dc:creator>mstell</dc:creator>
      <dc:date>2012-03-26T16:59:44Z</dc:date>
    </item>
    <item>
      <title>Re: Change to Oversampling seed creates different results.</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Change-to-Oversampling-seed-creates-different-results/m-p/387940#M5801</link>
      <description>&lt;P&gt;If you have 69,000 records with a 0.8% response rate, that only represents 552 observations. &amp;nbsp;Assuming you kept all of your events and undersampled your non-events so that the 552 events represent 40% of your sample, you only have 1,380 total observations in your training data set. &amp;nbsp; If you do any partitioning, that drops the number even further. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There are several issues&amp;nbsp;to consider in this scenario such as&lt;/P&gt;
&lt;P&gt;1. You have a limited number of events -- likely too few to consider splitting the raw data into training and validation, so I would recommend considering using cross-validation methods in your modeling nodes.&lt;/P&gt;
&lt;P&gt;2. &amp;nbsp;You only have 828 non-events out of 69,000 (roughly 1.2%) which is relatively small so it is possible (even likely) that the nature of your non-events is varying quite a bit as you change the seed.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;3. &amp;nbsp;If you have any missing values, your sample is even smaller unless you impute the missing values and/or use a method (e.g. Decision Tree) which does not rely on complete observations.&lt;/P&gt;
&lt;P&gt;4. &amp;nbsp;If you have variables that are highly related to one another (be it linearly or otherwise), you can see very different models from slightly different samples of the input data. &amp;nbsp;Decision Trees are highly unstable and can look dramatically different even though the underlying predictions might be similar. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You have several things that you might try to do:&lt;/P&gt;
&lt;P&gt;1. &amp;nbsp;Use the cross-validation options (they differ from node to node)&lt;/P&gt;
&lt;P&gt;2. &amp;nbsp;Take a larger proportion of non-events (if so, set up a target profile using the Decisions... capability in the Input Data Source node and use the Default with Inverse Prior Weights... option)&lt;/P&gt;
&lt;P&gt;3. &amp;nbsp;Try using the Memory-Based Reasoning node which uses one model to isolate easily classified observations and then fits a model to the remaining observations. &amp;nbsp;In this way, you are likely to avoid oversampling and can use your entire data set.&lt;/P&gt;
&lt;P&gt;4. &amp;nbsp;Fit a forest using the HP Forest node which will take samples of observations and variables and fit separate models which can then be combined into a final model. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You don't have a lot of observations, so depending on whether you have a lot of variables, you might find one or more of the methods described above to provide you more.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;I hope this helps!&lt;/P&gt;
&lt;P&gt;Doug&lt;/P&gt;</description>
      <pubDate>Mon, 14 Aug 2017 19:47:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Change-to-Oversampling-seed-creates-different-results/m-p/387940#M5801</guid>
      <dc:creator>DougWielenga</dc:creator>
      <dc:date>2017-08-14T19:47:49Z</dc:date>
    </item>
  </channel>
</rss>

