Solved: proc surveyselect provides same randomization

AJ_Brien · Posted 11-02-2019 02:17 PM

Hello,

I'm learning about proc surveyselect. To understand the concept better, I'm performing an exercise where I'm trying to randomly pick 50% of the population and assigning them as 'Test' an the remaining random 50% of the population, and assigning them as 'Control'. I used to use ranuni for the randomization part, but wanted to try doing that using surveyselect. This is the sample code that I'm running. My question here is, usually everytime I run ranuni, I get different test and control samples, but that's not the case with surveyselect. same accounts get selected for test and control when i use surveyselect. Is there a way to solve for that?

Thank you!

data dat1;
 input ids $ acc $ prod vv;
 cards;
 a  1 11 0
 b  2 22 0
 c  3 33 0
d  4  44 0
 ;
 run;


proc surveyselect data = dat1 method = SRS 
  sampsize = 2 seed = 12345 out = hsbs1;
  id _all_;
run;

I tried changing the seed value using ranuni within surveyselect but get an error, didn't think seed value expected a dataset!?!

data _null_;
sasrand=ranuni(time());
run;

proc surveyselect data = dat1 method = SRS 
  sampsize = 2 seed = sasrand out = hsbs1;
  id _all_;
run;

Error in this case:

ERROR: File WORK.SASRAND.DATA does not exist.

FreelanceReinh · Posted 11-03-2019 04:26 PM

@AJ_Brien wrote:

It runs well the first time, however when I run the code the 2nd, 3rd etc. time (from the 'rerun from here' sign till the end), seems like a few observations are being marked as 'T' even though they are not a part of the proc surveyselect output. Why would that be happening even when I use a fresh dat2 dataset to begin my sampling process.

With the test data you posted this should not happen and it did not happen in test runs on my computer. However, it would likely happen more or less often if variable PROD was not a unique key in dataset DAT1, but had duplicate values: In this case the subquery in the PROC SQL step would typically select more observations than had been selected by PROC SURVEYSELECT (because it would add those observations which share PROD values with observations in SAMPLE11, but which are not contained in SAMPLE11).

Another possible explanation for the additional "T" records is that the DATA step copying DAT1 to DAT2 did not run for some reason.

That said, I would also suggest simplifying the code. The two steps below could essentially replace everything after the first and before the last step of your code.

proc surveyselect data = dat1
  rate = .5
  seed = 0 out = sample11;
run;

proc sql;
create table dat2 as
select *, case when prod in (select prod from sample11)
               then 'T' else 'C' end as TC
from dat1;
quit;

I used the (SAMP)RATE= option of PROC SURVEYSELECT to draw a 50% random sample (method=SRS is the default) from DAT1 directly. (After all, DAT2 in your code is only a copy of DAT1 at that point.) The PROC SQL step obviates the need for an additional DATA step to set TC='C' for the records outside the random sample.

Alternatively, dataset DAT2 could even be created (without PROC SURVEYSELECT) in a single DATA step (using the RANUNI or better the RAND function; see the link in my previous post).

View solution in original post

FreelanceReinh · Posted 11-02-2019 03:20 PM

Hello @AJ_Brien,

@AJ_Brien wrote: My question here is, usually everytime I run ranuni, I get different test and control samples, but that's not the case with surveyselect. same accounts get selected for test and control when i use surveyselect. Is there a way to solve for that?

You obtain repeatable results with positive seed values and varying results with nonpositive seed values. This is true for both RANUNI and PROC SURVEYSELECT.

I tried changing the seed value using ranuni within surveyselect but get an error, didn't think seed value expected a dataset!?!
data _null_;
sasrand=ranuni(time());
run;

proc surveyselect data = dat1 method = SRS 
  sampsize = 2 seed = sasrand out = hsbs1;
  id _all_;
run;
Error in this case:

ERROR: File WORK.SASRAND.DATA does not exist.

This technique doesn't work because the DATA step variable SASRAND ceases to exist once the DATA _NULL_ step has finished. The name SASRAND in PROC SURVEYSELECT is then assumed to be the name of a dataset that contains stratum initial seeds (cf. documentation). Yes, the SEED= option accepts either an integer constant or a dataset name. Simply use seed=0 to get varying results.

The ID statement is redundant here because _ALL_ is the default.

(For obtaining specified sample sizes using the RANUNI technique [as mentioned in an earlier version of your post] please see methods 2 and 3 presented in http://support.sas.com/kb/24/722.html.)

AJ_Brien · Posted 11-03-2019 02:19 PM

Thank you for your response.

Adding seed as 0 did help.

This is my current code. Based on the randomly selected values I'm assigning them as Test and Control, all output of proc survey select is Test, remaining 50% from dat2 is control. My aim is to have different sets of test and control values every time, so I'm resetting the dataset everytime by creating dat2 from dat1. It runs well the first time, however when I run the code the 2nd, 3rd etc. time (from the 'rerun from here' sign till the end), seems like a few observations are being marked as 'T' even though they are not a part of the proc surveyselect output. Why would that be happening even when I use a fresh dat2 dataset to begin my sampling process.

data dat1;
 input idd $ acc $ prod vv;
 cards;
a  1 11 0
c  3 33 0
d  4 44 0
b  2 22 0
k  6 66 0
j  5 55 0
 ;
 run;

%let dsid=%sysfunc(open(dat1));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%put nobs= &nobs ;

/*loop can start here so that dat1 is not impacted and dat2 gets a reset everytime- no T/C values carried over*/
/*rerun from here*/
data dat2;
set dat1;
run;

proc surveyselect data = dat2 method = SRS 
  sampsize = %EVAL(&nobs./2)
 seed = 0 out = sample11;
  id _all_;
run;

proc sql;
alter table dat2
add TC char(3); 
update dat2
set TC = 'T' where prod in (select prod from sample11);
quit;

data dat2;
set dat2;
if TC ne 'T' then TC = 'C';
run;

PROC FREQ DATA=dat2; table tc; run;

FreelanceReinh · Posted 11-03-2019 04:26 PM

@AJ_Brien wrote:

It runs well the first time, however when I run the code the 2nd, 3rd etc. time (from the 'rerun from here' sign till the end), seems like a few observations are being marked as 'T' even though they are not a part of the proc surveyselect output. Why would that be happening even when I use a fresh dat2 dataset to begin my sampling process.

With the test data you posted this should not happen and it did not happen in test runs on my computer. However, it would likely happen more or less often if variable PROD was not a unique key in dataset DAT1, but had duplicate values: In this case the subquery in the PROC SQL step would typically select more observations than had been selected by PROC SURVEYSELECT (because it would add those observations which share PROD values with observations in SAMPLE11, but which are not contained in SAMPLE11).

Another possible explanation for the additional "T" records is that the DATA step copying DAT1 to DAT2 did not run for some reason.

That said, I would also suggest simplifying the code. The two steps below could essentially replace everything after the first and before the last step of your code.

proc surveyselect data = dat1
  rate = .5
  seed = 0 out = sample11;
run;

proc sql;
create table dat2 as
select *, case when prod in (select prod from sample11)
               then 'T' else 'C' end as TC
from dat1;
quit;

I used the (SAMP)RATE= option of PROC SURVEYSELECT to draw a 50% random sample (method=SRS is the default) from DAT1 directly. (After all, DAT2 in your code is only a copy of DAT1 at that point.) The PROC SQL step obviates the need for an additional DATA step to set TC='C' for the records outside the random sample.

Alternatively, dataset DAT2 could even be created (without PROC SURVEYSELECT) in a single DATA step (using the RANUNI or better the RAND function; see the link in my previous post).

AJ_Brien · Posted 11-04-2019 12:51 PM

yep, you're right. I changed my dataset values and ended up with using duplicate value for prod which was causing the issue. Proc surveyselect is working good now. happy to have learnt something new in addition to the usual ranuni. Thank you so much! 🙂

proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Re: proc surveyselect provides same randomization

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!