Solved: Reproduce stratified sample with PROC SURVEYSELECT

DerekD_WF · Posted 09-07-2022 01:47 PM

I am trying to code a PROC SURVEYSELECT to create a stratified random sample that is reproduceable (same sample any time it is ran). I have the seed= option set but I think I am still missing something else required as I do not get the same results each time I run my script. I have my code below and am wondering do I need to use some variables in the CONTROL statement to sort my data to ensure the same selections within the strata each time?

I haven't done much previously with this PROC in the past other than some simple random sampling where the seed= option was enough to keep the sample the same.

/* Get frequences for var1 x var2 strata */
proc freq data=work.pop_all noprint;
    table var1*var2 / out=work.pop_strata_1 ;
run;

/* determine strate ratios for input to surveyselect */
data work.pop_strata_2 error;
	set work.pop_strata_1;
	SAMPNUM = (PERCENT * 25000) / 100;
	_NSIZE_ = ROUND(SAMPNUM,1);
 	SAMPNUM=ROUND(SAMPNUM,.01);
	IF _NSIZE_ = 0 THEN OUTPUT ERROR;
	IF _NSIZE_ = 0 THEN DELETE;
	OUTPUT work.pop_strata_2;
run;

data work.pop_strata_3;
	set work.pop_strata_2;
 	keep var1 var2 _NSIZE_;
run;

proc sort data=work.pop_all;
    by var1 var2;
run;

proc sort data=work.pop_strata_3;
    by var1 var2;
run;

/* Create stratified sample of 25,000 accounts based on the 
	strata ratios in the population */
proc surveyselect data=work.pop_all seed=8092022 outseed
	out=work.samp(drop=SelectionProb SamplingWeight)
	sampsize=work.pop_strata_3;
	strata var1 var2;
	ID account company customer;
run;

/* Review the sample sizes output */
proc freq data=work.samp;
    tables var1 * var2;
run;

Zard · Posted 09-07-2022 05:56 PM

For a stratified random sample, you would not use the CONTROL statement. You do need to sort on your STRATA vars, as you have done. I agree with the previous reply, and the best practice for generating a reproducible sample is to save the sorted input data set and use it each time.

Also, you can use the ALLOC=PROP option in the STRATA statement and your code simplifies to:

proc sort data=pop_all;
by var1 var2;
run;
proc surveyselect data=work.pop_all seed=8092022 outseed
out=work.samp(drop=SelectionProb SamplingWeight)
sampsize=25000;
strata var1 var2/alloc=prop;
ID account company customer;
run;

View solution in original post

ballardw · Posted 09-07-2022 04:02 PM

How sure are you that your work.dw_pop_all is IDENTICAL to the last time the Proc Surveyselect was run?

Or work.pop?

If there are any changes in those two I would not expect the same result.

When I have had this sort of requirement the data sets used for Proc Surveyselect are kept in a permanent library so the sets stay the same. Then I just make sure the library is available and can regenerate the sample.

Note: running the code on a different version of SAS might result in different results as well.

DerekD_WF · Posted 09-07-2022 04:18 PM

Yes, same code, same database tables, same environment.

DerekD_WF · Posted 09-07-2022 04:21 PM

Realized when I was "cleaning up" some code for posting here I didn't get the dataset names all consistent. I corrected the code.

ballardw · Posted 09-07-2022 05:08 PM

One of the reasons I asked about how sure you are that sets are identical is the use of the WORK library. That means that a separate SAS session means a different SAS library, the new location work library, and sets can differ.

Sorting by two variables does not mean that the order of every other variable would be identical. So minor differences may be present that are not obvious from the code and therefore again not identical data from different runs.

If you start with the exact same DATA set with the same options and parameters and can prove the input data set is the same and get different selected values with the same SEED then you should contact Tech support so they can determine what is going on.

DerekD_WF · Posted 09-07-2022 05:18 PM

That goes back to my original question - do I need to sort and/or use the CONTROL statement to ensure the rows within the pop_all dataset are in the same order every time? I do know that my initial population, pulled from a Teradata database into the work data set work.pop_all is the same each time. As indicated in the code shared above, I do sort on var1 var2 for the strata input dataset and the population dataset. But I do not sort on any of the other vars in the work.pop_all dataset.
Do I need to sort by additional variables to ensure same order within the dataset each time? Does the CONTROL statement in SURVEYSELECT accomplish the same thing without doing a separate PROC SORT?

ballardw · Posted 09-07-2022 05:53 PM

@DerekD_WF wrote:
That goes back to my original question - do I need to sort and/or use the CONTROL statement to ensure the rows within the pop_all dataset are in the same order every time? I do know that my initial population, pulled from a Teradata database into the work data set work.pop_all is the same each time. As indicated in the code shared above, I do sort on var1 var2 for the strata input dataset and the population dataset. But I do not sort on any of the other vars in the work.pop_all dataset.
Do I need to sort by additional variables to ensure same order within the dataset each time? Does the CONTROL statement in SURVEYSELECT accomplish the same thing without doing a separate PROC SORT?

Actually if you did not have the previous version to actually compare the result of the pull I think you may be assuming the highlighted text to be true without actually testing it.

Depending on exactly how the data is "pulled" you could end up with quite a difference of actual order from one run to the next.

If the external database does not change content (sometimes a challenge all by itself) the order of returned records may be very problematic.

I would not expect the results of the Proc freq and dependent code to be an issue unless the counts change. But that Pop_all data base needs a lot more effort to ensure the records are in the same order. Like sorting by pretty much all of the variables. Since you are already sorting the set go ahead and do that in the Proc sort instead of using CONTROL variables.I suspect, that since you are already calling Proc Sort that it would be more efficient than getting into another procedure like Surveyselect that has to have overhead for the order.

Likely as a minimum:

proc sort data=work.pop_all;
    by var1 var2 account company customer;
run;

The question would be why, if the external data base doesn't change that you would "pull" the data again. That just wastes computer clock cycles. Save the data sets and reuse if needed. In my mind, the only reason I see to pull the data is that it may have changed. Which invalidates the expectation of identical Surveyselect results.

An exercise for the interested reader: Do the pull on that data 4 or 5 times at different times over the course of an hour or so, creating different data sets each time. Then run Proc Compare on some of the pairs of data sets. Confirm that the data comes in the same order for ALL values.

Zard · Posted 09-07-2022 05:56 PM

For a stratified random sample, you would not use the CONTROL statement. You do need to sort on your STRATA vars, as you have done. I agree with the previous reply, and the best practice for generating a reproducible sample is to save the sorted input data set and use it each time.

Also, you can use the ALLOC=PROP option in the STRATA statement and your code simplifies to:

proc sort data=pop_all;
by var1 var2;
run;
proc surveyselect data=work.pop_all seed=8092022 outseed
out=work.samp(drop=SelectionProb SamplingWeight)
sampsize=25000;
strata var1 var2/alloc=prop;
ID account company customer;
run;

DerekD_WF · Posted 09-15-2022 05:52 PM

Thank you for this, I wasn't aware of that proportional allocation option, saves a few steps of calculating that manually. Also, it seems as adding the additional sorting variables is doing the trick to ensure my seed= value is keeping the sample the same each run.

Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

Re: Reproduce stratified sample with PROC SURVEYSELECT

SAS Innovate 2025: Call for Content