- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi I am new to random sampling methods and would like to derive a random sample from my control data set based on percentages of strata found in an existing case data set.
(1) Sample Data sets
- sample data for the cases (I have 55 strata for my problem):
Stratum count pct
1 57 64.0
2 21 23.6
3 11 12.4
- sample data for the controls (I have > 2 millions records for my problem):
unit | strata | pct |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
A | 1 | 64 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
B | 2 | 23.6 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
C | 3 | 12.4 |
I used proc surveySelect to get the random sample without replacement using the variable 'strata' as strata and sample rate from my percentages found in my cases. However, after team discussion, we believed that this method of selection resulted in a large number of sample records loss.
Original code used to get my sample:
proc surveyselect data=ids_select method=srs seed=1953 samprate=case_strata
out=ids_matched_control;
strata strata;
run;
Thus, a new procedure is proposed to do the random selection (see below).
Proposed new procedure:
- First do probability sampling from among all the strata, using the relative proportions those strata represent among the case families. So in this step, we are selecting just one stratum from among the 55 that we have. If a stratum represents x% of all case families, then in this step we select that stratum with x% probability.
- Once a stratum is selected in (1), choose one family (without replacement) from the potential control families in that stratum. The selection of the family within the stratum should be uniformly random (an arbitrary selection from the available families in that stratum).
- Go back to (1) and repeat the sampling procedure, stopping only when the selection in (1) is a sufficiently large stratum AND the selection in (2) results in that entire large stratum having already been sampled (no more families available in that stratum).
Question: How do I do this? Do I need to do a nested do loop to get the result? Here is my proposed code(not working):
data want;
if _n_=1 then percent_to_select = pct* ranuni(12345);
retain percent_to_select;
set sample;
if ranuni(13579) <= pct;
run;
Any suggestion or recommendation is appreciated. Thank you.
Thanks,
Siew
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Anav0416,
Thanks for the clarification (esp. about what you meant by "sufficiently large")
@Anav0416 wrote:
Question: I agreed with your implementation but is wondering whether the new procedure actually produces less number of records than intended?
I don't think that the new procedure produces systematically smaller sample sizes than your previous approach. It depends on the data. Assuming that your dataset case_strata contained the rates 64.0, 23.6, 12.4 (percent) as shown in your initial post, your approach applied to my example data (i.e. with rates 60% for stratum 1 and 40% for stratum 2) would always result in 5 observations: 0.6*4=2.4, rounded up to 3, plus 0.4*4=1.6, rounded up to 2 (see documentation). With the new procedure the sample size is a random variable. Its expected value in this example is approx. 6.35 (according to my calculation involving the negative binomial distribution), hence greater than 5.
@Anav0416 wrote:
Will applying a cutoff point help in this case?
Yes, it seems plausible to me that you'll increase the total sample size (on average) by using such a cutoff. However, the proportions of the strata in the controls will then be less similar to those in the cases. Without a cutoff the algorithm tends to produce similar proportions (of course varying because it's not deterministic). So, if moderate deviations are acceptable, you could give it a try.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Anav0416 and welcome to the SAS Support Communities!
Try this:
/* Create test data for demonstration */
data cases;
input id $ stratum;
cards;
a 1
b 1
c 1
d 2
e 2
;
data controls;
input id $ stratum;
cards;
A 1
B 1
C 1
D 1
E 2
F 2
G 2
H 2
;
/* Determine stratum percentages in cases */
proc freq data=cases noprint;
tables stratum / out=frq_cases;
run;
/* Write stratum selection probabilities and number of strata to macro variables
(assuming strata are numbered 1, 2, ...!) */
proc sql noprint;
select put(percent/100, best16.) into :probs separated by ', '
from frq_cases
order by stratum;
select max(stratum) into :n trimmed
from frq_cases;
quit;
%put &=probs;
%put &=n;
/* Determine stratum sizes in controls */
proc freq data=controls noprint;
tables stratum / out=frq_controls;
run;
/* Write stratum sizes and number of controls to macro variables */
proc sql noprint;
select put(count, best12.) into :counts separated by ' '
from frq_controls
order by stratum;
select put(count(*), best12.) into :ntotal trimmed
from controls;
quit;
%put &=counts;
%put &=ntotal;
/* Determine sample sizes per stratum */
data _null_;
call streaminit(27182818);
array ncontr [&n] _temporary_ (&counts);
array nselect[&n] _temporary_ (&n*0);
do _n_=1 to &ntotal;
s=rand('table', &probs);
if nselect[s]<ncontr[s] then nselect[s]+1;
else leave;
end;
call symputx('smp_sizes', catx(' ', of nselect[*]));
run;
%put &=smp_sizes;
/* Perform stratified simple random sampling */
proc surveyselect data=controls
method=srs n=(&smp_sizes)
seed=31415927 out=want;
strata stratum;
run;
The DATA _NULL_ step prepares the actual sampling (by PROC SURVEYSELECT) following the rules of your "proposed new procedure" (at least how I've interpreted them). In particular, sampling is stopped as soon as an attempt is made to select an item from an exhausted stratum (i.e., if the number of already selected items from that sth stratum, nselect[s], equals the number of initially available items, ncontr[s]). As a consequence, at least one stratum will produce a log message of the form
NOTE: The sample size equals the number of sampling units. All units are included in the sample. NOTE: The above message was for the following stratum: ...
in the final PROC SURVEYSELECT step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear FreelanceReinhard;
Thank you for the quick response. I have tried successfully based on you suggested codes. However, the final output had less number of records(N=186, 200) than the original coding (N=219, 960). The purpose of the new procedures is thought to "max out" the sample size of the controls, since for most or all of the larger strata we should be selecting most of the families in those strata? My team also suggested defining "sufficiently large" strata as any with at least 0.5% relative probability, or something similar to that. We think that applying some cutoff will ensure that the new sampling procedure does not terminate too early.
Question: I agreed with your implementation but is wondering whether the new procedure actually produces less number of records than intended? Will applying a cutoff point help in this case?
Thank you again for your help as I have learned new way of coding regarding random sampling.
Respectfully,
Anav0416
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Anav0416,
Thanks for the clarification (esp. about what you meant by "sufficiently large")
@Anav0416 wrote:
Question: I agreed with your implementation but is wondering whether the new procedure actually produces less number of records than intended?
I don't think that the new procedure produces systematically smaller sample sizes than your previous approach. It depends on the data. Assuming that your dataset case_strata contained the rates 64.0, 23.6, 12.4 (percent) as shown in your initial post, your approach applied to my example data (i.e. with rates 60% for stratum 1 and 40% for stratum 2) would always result in 5 observations: 0.6*4=2.4, rounded up to 3, plus 0.4*4=1.6, rounded up to 2 (see documentation). With the new procedure the sample size is a random variable. Its expected value in this example is approx. 6.35 (according to my calculation involving the negative binomial distribution), hence greater than 5.
@Anav0416 wrote:
Will applying a cutoff point help in this case?
Yes, it seems plausible to me that you'll increase the total sample size (on average) by using such a cutoff. However, the proportions of the strata in the controls will then be less similar to those in the cases. Without a cutoff the algorithm tends to produce similar proportions (of course varying because it's not deterministic). So, if moderate deviations are acceptable, you could give it a try.