<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to split data into representative subsets in SAS Academy for Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/How-to-split-data-into-representative-subsets/m-p/445642#M89</link>
    <description>&lt;P&gt;Hi Luhan,&lt;/P&gt;
&lt;P&gt;I forwarded your post to my colleagues. Several replies stated that you can use the SAMPRATE in PROC SURVEYSELECT. Another suggested&amp;nbsp;if you have access to High Performance procedures to try&amp;nbsp;PROC HPSAMPLE.&amp;nbsp;And the final suggestion was this blog:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/" target="_blank"&gt;https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I hope this helps,&lt;/P&gt;
&lt;P&gt;theresa&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 14 Mar 2018 21:21:45 GMT</pubDate>
    <dc:creator>TheresaStemler</dc:creator>
    <dc:date>2018-03-14T21:21:45Z</dc:date>
    <item>
      <title>How to split data into representative subsets</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/How-to-split-data-into-representative-subsets/m-p/444543#M88</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;my goal is to create training and testing subsets that are representative of the original data set.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Situation&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, the original data &lt;FONT color="#3366FF"&gt;&lt;EM&gt;have&lt;/EM&gt; &lt;/FONT&gt;is made up by 80% of data from the US (region="US") and by 20% of data from Asia (region="Asia").&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have(drop=i);
	do i=1 to 10;
		if i&amp;gt;8 then region = "Asia";
		       else region = "US";
		output;
	end;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;I now want to randomly split &lt;FONT color="#3366FF"&gt;&lt;EM&gt;have&lt;/EM&gt; &lt;/FONT&gt;in 2 subsets each also being made up by 80% of data from the US and by 20% of data from Asia.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Solution so far&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Apparently, there is a comprehensive SAS procedure called SURVEYSELECT to handle this task. This is the best solution I was able to come up with to get the job done (splitting &lt;FONT color="#3366FF"&gt;&lt;EM&gt;have&lt;/EM&gt; &lt;/FONT&gt;into &lt;FONT color="#3366FF"&gt;&lt;EM&gt;want1&lt;/EM&gt; &lt;/FONT&gt;and &lt;FONT color="#3366FF"&gt;&lt;EM&gt;want2&lt;/EM&gt;&lt;/FONT&gt;; to keep things simple, a splitting ratio of 50% was applied):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/* BEFORE surveyselect... */

/* (1) we need number of obs */
%let dsid  = %sysfunc(open(have));
%let nobs  = %sysfunc(attrn(&amp;amp;dsid, nobs));
%let close = %sysfunc(close(&amp;amp;dsid.));

/* (2) we need to sort the data */
proc sql noprint;
	create view haveV as select * from have
	order by region;
quit;


proc surveyselect noprint
	data     = haveV
	out      = have2
	outall
	sampsize = %sysevalf(&amp;amp;nobs.*.5)
	seed     = 100
;
	strata region / alloc = prop;
run;


/* AFTER surveyselect... */

/* ...we need to split the data set ourselves */
data want1 want2;
	set have2;
	if Selected then output want1;
	            else output want2;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there a better approach (or a better way to use SURVEYSELECT) to get the splitting job done?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Considering that this splitting job is a frequently occuring task that can be accomplished with a few lines of code in other languages, there should be a better solution in SAS that doesn't suffer from the following short comings of the solution I found:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;I couldn't see how to pass a splitting ratio to SURVEYSELECT (samprate doesn't work here). The procedure rather needs the absolute number of obs for a sample. So, in order to apply a splitting ratio of x% (here: 50%), I first have to determine the number of obs in the original data set and compute the number of obs for the subset (here: &amp;amp;nobs.*.5).&lt;/LI&gt;&lt;LI&gt;To ensure that SURVEYSELECT works with any data set, one would actually need to temporarily rename the variables in the original data set (haven't done this here). For example, one of my data sets already has a variable called "SELECTED", so in this case SURVEYSELECT would issue a WARNING and couldn't compute the column "SELECTED".&amp;nbsp; This is a frequent problem when I'm working with SAS and I've already written a macro that temporarily renames all variables of a data set to "___var1", "___var2" ... "___varN", but still..&lt;/LI&gt;&lt;LI&gt;After using SURVEYSELECT, (a) a copy of the original data set (which can be quite large) is produced (instead of just directly creating the 2 subsets) and (b) another whole pass through the newly created data is required to create the subsets.&lt;/LI&gt;&lt;LI&gt;The input data has to be sorted. That means, if it's (technically) not possible to use a view (like I did here), one has to either sort the original data set or (temporarily) create a copy of it.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thank you very much!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;BR&lt;/P&gt;&lt;P&gt;Lu&lt;/P&gt;</description>
      <pubDate>Sun, 11 Mar 2018 13:04:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/How-to-split-data-into-representative-subsets/m-p/444543#M88</guid>
      <dc:creator>Luhan</dc:creator>
      <dc:date>2018-03-11T13:04:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to split data into representative subsets</title>
      <link>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/How-to-split-data-into-representative-subsets/m-p/445642#M89</link>
      <description>&lt;P&gt;Hi Luhan,&lt;/P&gt;
&lt;P&gt;I forwarded your post to my colleagues. Several replies stated that you can use the SAMPRATE in PROC SURVEYSELECT. Another suggested&amp;nbsp;if you have access to High Performance procedures to try&amp;nbsp;PROC HPSAMPLE.&amp;nbsp;And the final suggestion was this blog:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/" target="_blank"&gt;https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;I hope this helps,&lt;/P&gt;
&lt;P&gt;theresa&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Mar 2018 21:21:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Academy-for-Data-Science/How-to-split-data-into-representative-subsets/m-p/445642#M89</guid>
      <dc:creator>TheresaStemler</dc:creator>
      <dc:date>2018-03-14T21:21:45Z</dc:date>
    </item>
  </channel>
</rss>

