DATA Step, Macro, Functions and more

Random sampling with Proc Survey Select

Accepted Solution Solved
Reply
Contributor
Posts: 46
Accepted Solution

Random sampling with Proc Survey Select

Hi,

I previously generated two stratified random sample  Train and Validation data-sets using a fixed seed value(=1234).

 

However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.

I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).

Could this be due to some weird sort order in the input data being different? I can't think of anything else.

Here is roughly the code I used:

proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;

 

proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D; 
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;

 


Accepted Solutions
Solution
‎12-06-2017 06:39 AM
Super User
Posts: 13,058

Re: Random sampling with Proc Survey Select

If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.

 

See this brief example:

data example;
   input strata value;
datalines;
1  1
2 2
1  4
2 6
1  5
2 18
1  34
2 0
;
run;

proc sort data=example;
  by strata;
run;
proc print data=example;
  title "Order ofter first sort";
run;  title;

proc sort data=example;
  by descending value;
run;

proc sort data=example;
  by strata;
run;

proc print data=example;
  title "Order ofter other sorts;";
run;  title;

Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.

 

View solution in original post


All Replies
Solution
‎12-06-2017 06:39 AM
Super User
Posts: 13,058

Re: Random sampling with Proc Survey Select

If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.

 

See this brief example:

data example;
   input strata value;
datalines;
1  1
2 2
1  4
2 6
1  5
2 18
1  34
2 0
;
run;

proc sort data=example;
  by strata;
run;
proc print data=example;
  title "Order ofter first sort";
run;  title;

proc sort data=example;
  by descending value;
run;

proc sort data=example;
  by strata;
run;

proc print data=example;
  title "Order ofter other sorts;";
run;  title;

Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 125 views
  • 2 likes
  • 2 in conversation