BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
DerekD_WF
Obsidian | Level 7

I am trying to code a PROC SURVEYSELECT to create a stratified random sample that is reproduceable (same sample any time it is ran). I have the seed= option set but I think I am still missing something else required as I do not get the same results each time I run my script. I have my code below and am wondering do I need to use some variables in the CONTROL statement to sort my data to ensure the same selections within the strata each time?

 

I haven't done much previously with this PROC in the past other than some simple random sampling where the seed= option was enough to keep the sample the same. 

 

/* Get frequences for var1 x var2 strata */
proc freq data=work.pop_all noprint;
    table var1*var2 / out=work.pop_strata_1 ;
run;

/* determine strate ratios for input to surveyselect */
data work.pop_strata_2 error;
	set work.pop_strata_1;
	SAMPNUM = (PERCENT * 25000) / 100;
	_NSIZE_ = ROUND(SAMPNUM,1);
 	SAMPNUM=ROUND(SAMPNUM,.01);
	IF _NSIZE_ = 0 THEN OUTPUT ERROR;
	IF _NSIZE_ = 0 THEN DELETE;
	OUTPUT work.pop_strata_2;
run;

data work.pop_strata_3;
	set work.pop_strata_2;
 	keep var1 var2 _NSIZE_;
run;

proc sort data=work.pop_all;
    by var1 var2;
run;

proc sort data=work.pop_strata_3;
    by var1 var2;
run;

/* Create stratified sample of 25,000 accounts based on the 
	strata ratios in the population */
proc surveyselect data=work.pop_all seed=8092022 outseed
	out=work.samp(drop=SelectionProb SamplingWeight)
	sampsize=work.pop_strata_3;
	strata var1 var2;
	ID account company customer;
run;

/* Review the sample sizes output */
proc freq data=work.samp;
    tables var1 * var2;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Zard
SAS Employee

For a stratified random sample, you would not use the CONTROL statement. You do need to sort on your STRATA vars, as you have done. I agree with the previous reply, and the best practice for generating a reproducible sample is to save the sorted input data set and use it each time.

Also, you can use the ALLOC=PROP option in the STRATA statement and your code simplifies to:

 

proc sort data=pop_all;
  by var1 var2;
run;
proc surveyselect data=work.pop_all seed=8092022 outseed
    out=work.samp(drop=SelectionProb SamplingWeight)
    sampsize=25000;
  strata var1 var2/alloc=prop;
  ID account company customer;
run;

View solution in original post

8 REPLIES 8
ballardw
Super User

How sure are you that your work.dw_pop_all is IDENTICAL to the last time the Proc Surveyselect was run?

Or work.pop?

 

If there are any changes in those two I would not expect the same result.

 

When I have had this sort of requirement the data sets used for Proc Surveyselect are kept in a permanent library so the sets stay the same.  Then I just make sure the library is available and can regenerate the sample.

 

Note: running the code on a different version of SAS might result in different results as well.

DerekD_WF
Obsidian | Level 7
Yes, same code, same database tables, same environment.
DerekD_WF
Obsidian | Level 7
Realized when I was "cleaning up" some code for posting here I didn't get the dataset names all consistent. I corrected the code.
ballardw
Super User

One of the reasons I asked about how sure you are that sets are identical is the use of the WORK library. That means that a separate SAS session means a different SAS library, the new location work library, and sets can differ.

 

Sorting by two variables does not mean that the order of every other variable would be identical. So minor differences may be present that are not obvious from the code and therefore again not identical data from different runs.

 

If you start with the exact same DATA set with the same options and parameters and can prove the input data set is the same and get different selected values with the same SEED then you should contact Tech support so they can determine what is going on.

 

DerekD_WF
Obsidian | Level 7
That goes back to my original question - do I need to sort and/or use the CONTROL statement to ensure the rows within the pop_all dataset are in the same order every time? I do know that my initial population, pulled from a Teradata database into the work data set work.pop_all is the same each time. As indicated in the code shared above, I do sort on var1 var2 for the strata input dataset and the population dataset. But I do not sort on any of the other vars in the work.pop_all dataset.
Do I need to sort by additional variables to ensure same order within the dataset each time? Does the CONTROL statement in SURVEYSELECT accomplish the same thing without doing a separate PROC SORT?
ballardw
Super User

@DerekD_WF wrote:
That goes back to my original question - do I need to sort and/or use the CONTROL statement to ensure the rows within the pop_all dataset are in the same order every time? I do know that my initial population, pulled from a Teradata database into the work data set work.pop_all is the same each time. As indicated in the code shared above, I do sort on var1 var2 for the strata input dataset and the population dataset. But I do not sort on any of the other vars in the work.pop_all dataset.
Do I need to sort by additional variables to ensure same order within the dataset each time? Does the CONTROL statement in SURVEYSELECT accomplish the same thing without doing a separate PROC SORT?

Actually if you did not have the previous version to actually compare the result of the pull I think you may be assuming the highlighted text to be true without actually testing it.

Depending on exactly how the data is "pulled" you could end up with quite a difference of actual order from one run to the next.

If the external database does not change content (sometimes a challenge all by itself) the order of returned records may be very problematic.

I would not expect the results of the Proc freq and dependent code to be an issue unless the counts change. But that Pop_all data base needs a lot more effort to ensure the records are in the same order. Like sorting by pretty much all of the variables. Since you are already sorting the set go ahead and do that in the Proc sort instead of using CONTROL variables.I suspect, that since you are already calling Proc Sort that it would be more efficient than getting into another procedure like Surveyselect that has to have overhead for the order.

Likely as a minimum:

 

proc sort data=work.pop_all;
    by var1 var2 account company customer;
run;

 

The question would be why, if the external data base doesn't change that you would "pull" the data again. That just wastes computer clock cycles. Save the data sets and reuse if needed. In my mind, the only reason I see to pull the data is that it may have changed. Which invalidates the expectation of identical Surveyselect results.

 

 

An exercise for the interested reader: Do the pull on that data 4 or 5 times at different times over the course of an hour or so, creating different data sets each time. Then run Proc Compare on some of the pairs of data sets. Confirm that the data comes in the same order for ALL values.

 

 

Zard
SAS Employee

For a stratified random sample, you would not use the CONTROL statement. You do need to sort on your STRATA vars, as you have done. I agree with the previous reply, and the best practice for generating a reproducible sample is to save the sorted input data set and use it each time.

Also, you can use the ALLOC=PROP option in the STRATA statement and your code simplifies to:

 

proc sort data=pop_all;
  by var1 var2;
run;
proc surveyselect data=work.pop_all seed=8092022 outseed
    out=work.samp(drop=SelectionProb SamplingWeight)
    sampsize=25000;
  strata var1 var2/alloc=prop;
  ID account company customer;
run;

DerekD_WF
Obsidian | Level 7
Thank you for this, I wasn't aware of that proportional allocation option, saves a few steps of calculating that manually. Also, it seems as adding the additional sorting variables is doing the trick to ensure my seed= value is keeping the sample the same each run.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1396 views
  • 3 likes
  • 3 in conversation