<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Random sampling with Proc Survey Select in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Random-sampling-with-Proc-Survey-Select/m-p/417540#M102570</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I previously generated&amp;nbsp;two stratified random sample&amp;nbsp; Train and Validation data-sets using a fixed seed value(=1234).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.&lt;/P&gt;
&lt;P&gt;I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).&lt;/P&gt;
&lt;P&gt;Could this be due to some weird sort order in the input data being different? I can't think of anything else.&lt;/P&gt;
&lt;P&gt;Here is roughly the code I used:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D; 
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 30 Nov 2017 20:09:44 GMT</pubDate>
    <dc:creator>nstdt</dc:creator>
    <dc:date>2017-11-30T20:09:44Z</dc:date>
    <item>
      <title>Random sampling with Proc Survey Select</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Random-sampling-with-Proc-Survey-Select/m-p/417540#M102570</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I previously generated&amp;nbsp;two stratified random sample&amp;nbsp; Train and Validation data-sets using a fixed seed value(=1234).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.&lt;/P&gt;
&lt;P&gt;I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).&lt;/P&gt;
&lt;P&gt;Could this be due to some weird sort order in the input data being different? I can't think of anything else.&lt;/P&gt;
&lt;P&gt;Here is roughly the code I used:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D; 
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2017 20:09:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Random-sampling-with-Proc-Survey-Select/m-p/417540#M102570</guid>
      <dc:creator>nstdt</dc:creator>
      <dc:date>2017-11-30T20:09:44Z</dc:date>
    </item>
    <item>
      <title>Re: Random sampling with Proc Survey Select</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Random-sampling-with-Proc-Survey-Select/m-p/417545#M102573</link>
      <description>&lt;P&gt;If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;See this brief example:&lt;/P&gt;
&lt;PRE&gt;data example;
   input strata value;
datalines;
1  1
2 2
1  4
2 6
1  5
2 18
1  34
2 0
;
run;

proc sort data=example;
  by strata;
run;
proc print data=example;
  title "Order ofter first sort";
run;  title;

proc sort data=example;
  by descending value;
run;

proc sort data=example;
  by strata;
run;

proc print data=example;
  title "Order ofter other sorts;";
run;  title;&lt;/PRE&gt;
&lt;P&gt;Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2017 20:17:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Random-sampling-with-Proc-Survey-Select/m-p/417545#M102573</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2017-11-30T20:17:57Z</dc:date>
    </item>
  </channel>
</rss>

