<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Oversample: what about non-event sample? in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661732#M31588</link>
    <description>&lt;P&gt;&lt;EM&gt;&lt;FONT size="4"&gt;Thanks for you suggestion&lt;/FONT&gt;.&lt;/EM&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="font-size: 10.0pt; font-family: 'Arial',sans-serif;"&gt;2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;FONT face="arial,helvetica,sans-serif" size="4"&gt;&lt;SPAN style="font-size: 10.0pt; font-family: 'Arial',sans-serif;"&gt;Stratification on predictors is a tricky thing. On a variable selection stage of modeling it could be 50 or more predictors. Assume that you bin them and you have 10 bins for each variable, then this 1% of data should have 10^50 observations, that is an astronomic number. &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial, sans-serif"&gt;&lt;SPAN style="font-size: 13.3333px;"&gt;&lt;I&gt;So more practical is to compare distributions in 1% and 98% sample and resample if one of the predictors did not pass the test.&lt;/I&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 17 Jun 2020 13:35:56 GMT</pubDate>
    <dc:creator>SPR</dc:creator>
    <dc:date>2020-06-17T13:35:56Z</dc:date>
    <item>
      <title>Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661718#M31586</link>
      <description>&lt;P&gt;Suppose we have a case with low event rate, say 1%. In 1:1 oversample we select all events 1% and 1% of non-events. Then there are two important questions:&lt;/P&gt;
&lt;P&gt;How can we guarantee that this 1% of non-events is a representative sample of the rest 98% of non-events?&lt;/P&gt;
&lt;P&gt;What procedures can be used to test that distributions of predictors are the same in 1% and 98% non-event samples?&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jun 2020 12:34:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661718#M31586</guid>
      <dc:creator>SPR</dc:creator>
      <dc:date>2020-06-17T12:34:21Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661727#M31587</link>
      <description>&lt;P&gt;This is what I usually do when developing models within the financial sector (mainly scorecards to predict risk of default):&lt;/P&gt;
&lt;P&gt;1. Use random sampling to select the 1% non-events; usually that guarantees you get a representative sample&lt;/P&gt;
&lt;P&gt;2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...&lt;/P&gt;
&lt;P&gt;3. Usually I do sampling with Proc SurveySelect&lt;/P&gt;
&lt;P&gt;4. I usually check the 1% sample against the entire "non-events" population; one simple approach is to compare the distribution on key variables, based a similar rational as in step (2). Differences can be assessed either by using a Chi-square test or Information Value&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jun 2020 13:02:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661727#M31587</guid>
      <dc:creator>pvareschi</dc:creator>
      <dc:date>2020-06-17T13:02:29Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661732#M31588</link>
      <description>&lt;P&gt;&lt;EM&gt;&lt;FONT size="4"&gt;Thanks for you suggestion&lt;/FONT&gt;.&lt;/EM&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="font-size: 10.0pt; font-family: 'Arial',sans-serif;"&gt;2. As an alternative, you may want to stratify on important segmentation variables/factors. The choise of which variables to use depends on the context of your problem/analysis; for example: region, gender, age group, customer type...&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;FONT face="arial,helvetica,sans-serif" size="4"&gt;&lt;SPAN style="font-size: 10.0pt; font-family: 'Arial',sans-serif;"&gt;Stratification on predictors is a tricky thing. On a variable selection stage of modeling it could be 50 or more predictors. Assume that you bin them and you have 10 bins for each variable, then this 1% of data should have 10^50 observations, that is an astronomic number. &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial, sans-serif"&gt;&lt;SPAN style="font-size: 13.3333px;"&gt;&lt;I&gt;So more practical is to compare distributions in 1% and 98% sample and resample if one of the predictors did not pass the test.&lt;/I&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jun 2020 13:35:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661732#M31588</guid>
      <dc:creator>SPR</dc:creator>
      <dc:date>2020-06-17T13:35:56Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661734#M31589</link>
      <description>&lt;P&gt;Of course, a high number of predictors would make things very difficult to manage. Just to give you an idea, I have worked on projects with between 500 and 1000 predictors!&lt;/P&gt;
&lt;P&gt;In those situations, I still find it useful to apply stratified sampling with only 3 maximum 5 key variables (i.e. inputs that are knows/expected to be very important, not only statistically, but from a business/problem context point of view).&lt;/P&gt;
&lt;P&gt;Something I forgot to mention in my first reply: before running statistical tests, I would recommend to compare the distribution visually, by using simple histrograms/bar charts...if the charts look similar then it might well be enough (i.e. you need to worry only when distributions are significantly different).&lt;/P&gt;
&lt;P&gt;Last thing: issues with sampling are most likely to affect predictors with skewed distributions, especially categorical ones. Therefore, before doing the sampling, make a note of categorical inputs with rare levels. Possible workarounds are:&lt;/P&gt;
&lt;P&gt;1. Collapse levels before sampling&lt;/P&gt;
&lt;P&gt;2. Increase the % sampled: instead of using a 50-50 split, you may want to take all 1% of events and 2% or 3% of non-events (i.e. 1:2 or 1:3 sampling ratio)&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jun 2020 13:45:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/661734#M31589</guid>
      <dc:creator>pvareschi</dc:creator>
      <dc:date>2020-06-17T13:45:52Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663146#M31602</link>
      <description>&lt;P&gt;1)Since 1% is simple random sample from 98%,I could image 1% represents 98% .&lt;BR /&gt;But I would like to sample more data from 98% ,like 5% (i.e. good:bad = 4:1 or 3:1 )&lt;BR /&gt;I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.&lt;/P&gt;
&lt;P&gt;2)That is why you need option PProb= of MODEL to adjust predictive Prob.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Jun 2020 12:51:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663146#M31602</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2020-06-18T12:51:46Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663150#M31603</link>
      <description>&lt;P&gt;Thanks for your information.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;I remembered there is a paper trying to use Cluster Analysis to sample better 1% to represent 98%.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Could you please give a reference to this paper?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Jun 2020 13:02:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663150#M31603</guid>
      <dc:creator>SPR</dc:creator>
      <dc:date>2020-06-18T13:02:07Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663154#M31605</link>
      <description>&lt;P&gt;Sorry. I lost it.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If I was right, it is just sample 1% from the center of cluster.&lt;/P&gt;
&lt;P&gt;Or&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13633"&gt;@StatDave&lt;/a&gt;&amp;nbsp; &amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13684"&gt;@Rick_SAS&lt;/a&gt;&amp;nbsp; have some clue .&lt;/P&gt;</description>
      <pubDate>Thu, 18 Jun 2020 13:21:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663154#M31605</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2020-06-18T13:21:13Z</dc:date>
    </item>
    <item>
      <title>Re: Oversample: what about non-event sample?</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663158#M31606</link>
      <description>Do you mean clustering observations or variables? &lt;BR /&gt;I can imagine the following approach: cluster predictors and select the best (minimum 1-R^2) predictor from each cluster, that could dramatically reduce number of potential predictors. Than create 1% sample of non event stratified by those best predictors.</description>
      <pubDate>Thu, 18 Jun 2020 13:29:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Oversample-what-about-non-event-sample/m-p/663158#M31606</guid>
      <dc:creator>SPR</dc:creator>
      <dc:date>2020-06-18T13:29:22Z</dc:date>
    </item>
  </channel>
</rss>

