BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anav0416
Calcite | Level 5

Hi I am new to random sampling methods and would like to derive a random sample from my control data set based on percentages of strata found in an existing case data set.

(1) Sample Data sets

  • sample data for the cases (I have 55 strata for my problem):

Stratum   count    pct

1               57      64.0
2               21      23.6 
3               11      12.4 

 

  • sample data for the controls (I have > 2 millions records for my problem):
unitstratapct
A164
A164
A164
A164
A164
A164
A164
A164
A164
A164
A164
A164
A164
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
B223.6
C312.4
C312.4
C312.4
C312.4
C312.4
C312.4
C312.4
C312.4

I used proc surveySelect to get the random sample without replacement using the variable 'strata' as strata and sample rate from my percentages found in my cases. However, after team discussion, we believed that this method of selection resulted in a large number of sample records loss.

 

Original code used to get my sample:

proc surveyselect data=ids_select method=srs seed=1953 samprate=case_strata
out=ids_matched_control;
strata strata;
run;

 

Thus, a new procedure is proposed to do the random selection (see below).

Proposed new procedure:

  

  • First do probability sampling from among all the strata, using the relative proportions those strata represent among the case families.  So in this step, we are selecting just one stratum from among the 55 that we have.  If a stratum represents x% of all case families, then in this step we select that stratum with x% probability.    
  • Once a stratum is selected in (1), choose one family (without replacement) from the potential control families in that stratum.  The selection of the family within the stratum should be uniformly random (an arbitrary selection from the available families in that stratum). 
  • Go back to (1) and repeat the sampling procedure, stopping only when the selection in (1) is a sufficiently large stratum AND the selection in (2) results in that entire large stratum having already been sampled (no more families available in that stratum). 

Question: How do I do this? Do I need to do a nested do loop to get the result? Here is my proposed code(not working):

data want;

     if _n_=1 then percent_to_select = pct* ranuni(12345);

    retain percent_to_select;

   set sample;

   if ranuni(13579) <= pct;

run;

 

Any suggestion or recommendation is appreciated. Thank you.

 

Thanks,

Siew

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Anav0416,

 

Thanks for the clarification (esp. about what you meant by "sufficiently large")


@Anav0416 wrote:


I don't think that the new procedure produces systematically smaller sample sizes than your previous approach. It depends on the data. Assuming that your dataset case_strata contained the rates 64.0, 23.6, 12.4 (percent) as shown in your initial post, your approach applied to my example data (i.e. with rates 60% for stratum 1 and 40% for stratum 2) would always result in 5 observations: 0.6*4=2.4, rounded up to 3, plus 0.4*4=1.6, rounded up to 2 (see documentation). With the new procedure the sample size is a random variable. Its expected value in this example is approx. 6.35 (according to my calculation involving the negative binomial distribution), hence greater than 5.


@Anav0416 wrote:


Yes, it seems plausible to me that you'll increase the total sample size (on average) by using such a cutoff. However, the proportions of the strata in the controls will then be less similar to those in the cases. Without a cutoff the algorithm tends to produce similar proportions (of course varying because it's not deterministic). So, if moderate deviations are acceptable, you could give it a try.

View solution in original post

3 REPLIES 3
FreelanceReinh
Jade | Level 19

Hello @Anav0416 and welcome to the SAS Support Communities!

 

Try this:

/* Create test data for demonstration */

data cases;
input id $ stratum;
cards;
a 1
b 1
c 1
d 2
e 2
;

data controls;
input id $ stratum;
cards;
A 1
B 1
C 1
D 1
E 2
F 2
G 2
H 2
;

/* Determine stratum percentages in cases */

proc freq data=cases noprint;
tables stratum / out=frq_cases;
run;

/* Write stratum selection probabilities and number of strata to macro variables
   (assuming strata are numbered 1, 2, ...!) */

proc sql noprint;
select put(percent/100, best16.) into :probs separated by ', '
from frq_cases
order by stratum;

select max(stratum) into :n trimmed
from frq_cases;
quit;

%put &=probs;
%put &=n;

/* Determine stratum sizes in controls */

proc freq data=controls noprint;
tables stratum / out=frq_controls;
run;

/* Write stratum sizes and number of controls to macro variables */

proc sql noprint;
select put(count, best12.) into :counts separated by ' '
from frq_controls
order by stratum;

select put(count(*), best12.) into :ntotal trimmed
from controls;
quit;

%put &=counts;
%put &=ntotal;

/* Determine sample sizes per stratum */

data _null_;
call streaminit(27182818);
array ncontr [&n] _temporary_ (&counts);
array nselect[&n] _temporary_ (&n*0);
do _n_=1 to &ntotal;
  s=rand('table', &probs);
  if nselect[s]<ncontr[s] then nselect[s]+1;
  else leave;
end;
call symputx('smp_sizes', catx(' ', of nselect[*]));
run;

%put &=smp_sizes;

/* Perform stratified simple random sampling */

proc surveyselect data=controls
method=srs n=(&smp_sizes)
seed=31415927 out=want;
strata stratum;
run;

The DATA _NULL_ step prepares the actual sampling (by PROC SURVEYSELECT) following the rules of your "proposed new procedure" (at least how I've interpreted them). In particular, sampling is stopped as soon as an attempt is made to select an item from an exhausted stratum (i.e., if the number of already selected items from that sth stratum, nselect[s], equals the number of initially available items, ncontr[s]). As a consequence, at least one stratum will produce a log message of the form

NOTE: The sample size equals the number of sampling units. All units are included in the sample.
NOTE: The above message was for the following stratum:
...

in the final PROC SURVEYSELECT step.

Anav0416
Calcite | Level 5

Dear 

 

 

 

FreelanceReinh
Jade | Level 19

Hi @Anav0416,

 

Thanks for the clarification (esp. about what you meant by "sufficiently large")


@Anav0416 wrote:


I don't think that the new procedure produces systematically smaller sample sizes than your previous approach. It depends on the data. Assuming that your dataset case_strata contained the rates 64.0, 23.6, 12.4 (percent) as shown in your initial post, your approach applied to my example data (i.e. with rates 60% for stratum 1 and 40% for stratum 2) would always result in 5 observations: 0.6*4=2.4, rounded up to 3, plus 0.4*4=1.6, rounded up to 2 (see documentation). With the new procedure the sample size is a random variable. Its expected value in this example is approx. 6.35 (according to my calculation involving the negative binomial distribution), hence greater than 5.


@Anav0416 wrote:


Yes, it seems plausible to me that you'll increase the total sample size (on average) by using such a cutoff. However, the proportions of the strata in the controls will then be less similar to those in the cases. Without a cutoff the algorithm tends to produce similar proportions (of course varying because it's not deterministic). So, if moderate deviations are acceptable, you could give it a try.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 724 views
  • 2 likes
  • 2 in conversation