Hi,
I previously generated two stratified random sample Train and Validation data-sets using a fixed seed value(=1234).
However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.
I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).
Could this be due to some weird sort order in the input data being different? I can't think of anything else.
Here is roughly the code I used:
proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;
proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D;
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;
If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.
See this brief example:
data example; input strata value; datalines; 1 1 2 2 1 4 2 6 1 5 2 18 1 34 2 0 ; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter first sort"; run; title; proc sort data=example; by descending value; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter other sorts;"; run; title;
Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.
If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.
See this brief example:
data example; input strata value; datalines; 1 1 2 2 1 4 2 6 1 5 2 18 1 34 2 0 ; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter first sort"; run; title; proc sort data=example; by descending value; run; proc sort data=example; by strata; run; proc print data=example; title "Order ofter other sorts;"; run; title;
Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.