<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478770#M7188</link>
    <description>&lt;P&gt;You'll need to add some more stratification variables to that instead of a random sample, in my opinion at least.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Specifically around judges, because different judges are biased in certain directions you'll want to make sure that you have equal records in the three to balance it out.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 17 Jul 2018 19:06:27 GMT</pubDate>
    <dc:creator>Reeza</dc:creator>
    <dc:date>2018-07-17T19:06:27Z</dc:date>
    <item>
      <title>DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478768#M7187</link>
      <description>&lt;P&gt;Good day&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I need to develop a logistic regression model.&amp;nbsp; with 12 variables and 1 target variable.&amp;nbsp; the target variable is the court hearing outcome of guilt and not guilty verdict.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What is the best data partitioning ratio for a data set of 1600 records in test and validation part?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What is best method to mitigate an unequal target variable with a split of "guilty verdict" = 1380 and "not guilty verdict" = 240?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jul 2018 19:02:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478768#M7187</guid>
      <dc:creator>Sachin51</dc:creator>
      <dc:date>2018-07-17T19:02:12Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478770#M7188</link>
      <description>&lt;P&gt;You'll need to add some more stratification variables to that instead of a random sample, in my opinion at least.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Specifically around judges, because different judges are biased in certain directions you'll want to make sure that you have equal records in the three to balance it out.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jul 2018 19:06:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478770#M7188</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2018-07-17T19:06:27Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478772#M7189</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But what ratio should I used? 50/50 or 55/45 or 60/40?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jul 2018 19:08:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478772#M7189</guid>
      <dc:creator>Sachin51</dc:creator>
      <dc:date>2018-07-17T19:08:48Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478780#M7190</link>
      <description>&lt;P&gt;Take a look at some of your breakdowns and see what's feasible, it does also depend on how many variables you're using. For example, if you have 30 variables you need roughly 25 per variable to get a good estimate so the minimum size for any data set should be 725 - which is too big for your data set, so you'd have to reduce the number of variables or get more observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;AFAIK, there isn't a hard and fast rule for splitting the data, though, 60/20/20 is what I've seen a lot these days. Don't forget the test data set and make sure to only use it for that. Unless you're doing CV but I would still recommend a neutral test data set.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/221558"&gt;@Sachin51&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Thanks.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But what ratio should I used? 50/50 or 55/45 or 60/40?&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jul 2018 19:19:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478780#M7190</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2018-07-17T19:19:30Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478787#M7191</link>
      <description>&lt;P&gt;I only need to split the data into train and validation data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I only have 12 variables.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jul 2018 19:23:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/478787#M7191</guid>
      <dc:creator>Sachin51</dc:creator>
      <dc:date>2018-07-17T19:23:41Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/479850#M7197</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;The event rate of guilty = around 17 % and rest is not guilty.&lt;/P&gt;&lt;P&gt;I suggest you use stratified sampling first and then partition the data set accordingly.(55:20:25) (test:valid:train)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;random under sampling&lt;/P&gt;&lt;P&gt;take around 20-22 % of not guilty event observations without replacement and merge them with guilty even rate.&lt;/P&gt;&lt;P&gt;thus : 320 non guilty + 240 guilty = 560 total observations. event rate becomes : 42.8 %.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;but this is not very satisfactory, so can use as benchmark.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2.random over sampling:&lt;/P&gt;&lt;P&gt;sampling with replacement.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;one approach you can use is : double the instances of guilty, i.e. duplicate the records and total dataset then will have guilty event rate&amp;nbsp; to 29 % overall. but beware this could cause over-fitting.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Use SMOTE.&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jul 2018 11:06:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/479850#M7197</guid>
      <dc:creator>sachinkalra</dc:creator>
      <dc:date>2018-07-20T11:06:31Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/480053#M7198</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;From the original 1600 i need to select 80% sample. and then split the data into train and validation data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So what will be the best splitting ratio?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jul 2018 19:48:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/480053#M7198</guid>
      <dc:creator>Sachin51</dc:creator>
      <dc:date>2018-07-20T19:48:18Z</dc:date>
    </item>
    <item>
      <title>Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/480402#M7204</link>
      <description>&lt;P&gt;I believe to take a sample containing 80% observations, if you want to touch target variables classes equally then you must try what i mentioned above, else you can start with this in order Train:Validation:Test as&lt;/P&gt;&lt;P&gt;1: 40:30:30 (SAS default)&lt;/P&gt;&lt;P&gt;2: 45:25:30&lt;/P&gt;&lt;P&gt;3. 50:25:25&lt;/P&gt;&lt;P&gt;4. 50 :30:20&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;like this you can try and see what set gives you a better generalized model.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Jul 2018 12:42:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/DATA-PARTITION-RATIO-AND-BINARY-TARGET-VARIABLE/m-p/480402#M7204</guid>
      <dc:creator>sachinkalra</dc:creator>
      <dc:date>2018-07-23T12:42:20Z</dc:date>
    </item>
  </channel>
</rss>

