<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How do i split my dataset into 70% training , 30% testing ? in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131868#M1145</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Dear all , &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;I have a dataset in &lt;STRONG&gt;csv&lt;/STRONG&gt; format. I am looking for a way/tool to &lt;STRONG&gt;randomly&lt;/STRONG&gt; done by dividing &lt;STRONG&gt;70%&lt;/STRONG&gt; of the database for training and &lt;STRONG&gt;30%&lt;/STRONG&gt; for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Any suggestions / methods / guide ?&amp;nbsp; or the use of EG ? EM ?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Thank you. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;Regards,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;YL&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Mon, 11 Mar 2013 02:27:05 GMT</pubDate>
    <dc:creator>cody_q</dc:creator>
    <dc:date>2013-03-11T02:27:05Z</dc:date>
    <item>
      <title>How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131868#M1145</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Dear all , &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;I have a dataset in &lt;STRONG&gt;csv&lt;/STRONG&gt; format. I am looking for a way/tool to &lt;STRONG&gt;randomly&lt;/STRONG&gt; done by dividing &lt;STRONG&gt;70%&lt;/STRONG&gt; of the database for training and &lt;STRONG&gt;30%&lt;/STRONG&gt; for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Any suggestions / methods / guide ?&amp;nbsp; or the use of EG ? EM ?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Thank you. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;Regards,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;YL&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Mar 2013 02:27:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131868#M1145</guid>
      <dc:creator>cody_q</dc:creator>
      <dc:date>2013-03-11T02:27:05Z</dc:date>
    </item>
    <item>
      <title>Re: How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131869#M1146</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Simply add&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;if ranuni() &amp;lt; 0.7 then set="TRAINING";&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;else set = "TESTING";&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;to create a new variable as you read your dataset.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Mar 2013 02:53:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131869#M1146</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2013-03-11T02:53:12Z</dc:date>
    </item>
    <item>
      <title>Re: How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131870#M1147</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi PGStats ,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;How could i use the above code to create new varaible ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Mar 2013 10:20:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131870#M1147</guid>
      <dc:creator>cody_q</dc:creator>
      <dc:date>2013-03-11T10:20:17Z</dc:date>
    </item>
    <item>
      <title>Re: How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131871#M1148</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Those statements would be added to a datastep to create a new character variable called &lt;STRONG&gt;set&lt;/STRONG&gt; that would take the value &lt;STRONG&gt;TRAINING&lt;/STRONG&gt; randomly for 70% of observations and the value &lt;STRONG&gt;TESTING&lt;/STRONG&gt; otherwise.&lt;BR /&gt; &lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Mar 2013 14:01:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131871#M1148</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2013-03-11T14:01:22Z</dc:date>
    </item>
    <item>
      <title>Re: How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131872#M1149</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Well, if you have EM, then splitting the data into Training and Testing is trivial.&amp;nbsp; The feature is a default feature when creating your SAS data in EM.&amp;nbsp; You can also use a Data Partition Node. &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 09 Apr 2013 15:22:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131872#M1149</guid>
      <dc:creator>jaredp</dc:creator>
      <dc:date>2013-04-09T15:22:06Z</dc:date>
    </item>
    <item>
      <title>Re: How do i split my dataset into 70% training , 30% testing ?</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131873#M1150</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way.&amp;nbsp; Here's one approach:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;filename csvfile 'path to existing csv file';&lt;/P&gt;&lt;P&gt;filename train 'path to a training subset';&lt;/P&gt;&lt;P&gt;filename test 'path to a testing subset';&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data _null_;&lt;/P&gt;&lt;P&gt;&amp;nbsp; infile csvfile;&lt;/P&gt;&lt;P&gt;&amp;nbsp; input @;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if ranuni(12345) &amp;lt; 0.7 then file train;&lt;/P&gt;&lt;P&gt;&amp;nbsp; else file test;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; put _infile_;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The drawback is that you will get approximately 70/30, not exact.&amp;nbsp; If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Good luck.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 09 Apr 2013 16:02:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/How-do-i-split-my-dataset-into-70-training-30-testing/m-p/131873#M1150</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2013-04-09T16:02:01Z</dc:date>
    </item>
  </channel>
</rss>

