Hi,
I previously generated two stratified random sample Train and Validation data-sets using a fixed seed value(=1234).
However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.
I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).
Could this be due to some weird sort order in the input data being different? I can't think of anything else.
Here is roughly the code I used:
proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;
proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D;
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;
If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.
See this brief example:
data example; input strata value; datalines; 1 1 2 2 1 4 2 6 1 5 2 18 1 34 2 0 ; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter first sort"; run; title; proc sort data=example; by descending value; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter other sorts;"; run; title;
Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.
If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.
See this brief example:
data example; input strata value; datalines; 1 1 2 2 1 4 2 6 1 5 2 18 1 34 2 0 ; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter first sort"; run; title; proc sort data=example; by descending value; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter other sorts;"; run; title;
Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.