BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nstdt
Quartz | Level 8

Hi,

I previously generated two stratified random sample  Train and Validation data-sets using a fixed seed value(=1234).

 

However, on a subsequent run of the same code, I am unable to replicate the same data-sets exactly.

I find that on this second run, some records in the original train set are now in validation and vice versa.The number of rows in each Train and Validation pair is a constant, though (each time its 12,475 and 3098 rows each).

Could this be due to some weird sort order in the input data being different? I can't think of anything else.

Here is roughly the code I used:

proc sort data=overall_new;
by A B C D; /* A,B,C,D are strata variables */
run;

 

proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D; 
run;
/*Train data - 12475 rows*/
data local.intime_TR0516;
set sorted_intime;
if selected =1;
run;
/*Validation data - 3098 rows*/
data local.intime_VAL0516;
set sorted_intime;
if selected =0;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.

 

See this brief example:

data example;
   input strata value;
datalines;
1  1
2 2
1  4
2 6
1  5
2 18
1  34
2 0
;
run;

proc sort data=example;
  by strata;
run;
proc print data=example;
  title "Order ofter first sort";
run;  title;

proc sort data=example;
  by descending value;
run;

proc sort data=example;
  by strata;
run;

proc print data=example;
  title "Order ofter other sorts;";
run;  title;

Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.

 

View solution in original post

1 REPLY 1
ballardw
Super User

If the base data set was manipulated in a way that affected the order then sorting by the strata variables could well result in a different order of the actual records of the data set. So with the data in a different order, though possibly slight, the resulting selection is different as the selection should get the same row-order records.

 

See this brief example:

data example;
   input strata value;
datalines;
1  1
2 2
1  4
2 6
1  5
2 18
1  34
2 0
;
run;

proc sort data=example;
  by strata;
run;
proc print data=example;
  title "Order ofter first sort";
run;  title;

proc sort data=example;
  by descending value;
run;

proc sort data=example;
  by strata;
run;

proc print data=example;
  title "Order ofter other sorts;";
run;  title;

Exercise for the interested reader to select 2 records per strata from the set in different orders to see the difference in selection.

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 1228 views
  • 2 likes
  • 2 in conversation