Hi I am new to random sampling methods and would like to derive a random sample from my control data set based on percentages of strata found in an existing case data set. (1) Sample Data sets sample data for the cases (I have 55 strata for my problem): Stratum count pct 1 57 64.0 2 21 23.6 3 11 12.4 sample data for the controls (I have > 2 millions records for my problem): unit strata pct A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 A 1 64 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 B 2 23.6 C 3 12.4 C 3 12.4 C 3 12.4 C 3 12.4 C 3 12.4 C 3 12.4 C 3 12.4 C 3 12.4 I used proc surveySelect to get the random sample without replacement using the variable 'strata' as strata and sample rate from my percentages found in my cases. However, after team discussion, we believed that this method of selection resulted in a large number of sample records loss. Original code used to get my sample: proc surveyselect data=ids_select method=srs seed=1953 samprate=case_strata out=ids_matched_control; strata strata; run; Thus, a new procedure is proposed to do the random selection (see below). Proposed new procedure: First do probability sampling from among all the strata, using the relative proportions those strata represent among the case families. So in this step, we are selecting just one stratum from among the 55 that we have. If a stratum represents x% of all case families, then in this step we select that stratum with x% probability. Once a stratum is selected in (1), choose one family (without replacement) from the potential control families in that stratum. The selection of the family within the stratum should be uniformly random (an arbitrary selection from the available families in that stratum). Go back to (1) and repeat the sampling procedure, stopping only when the selection in (1) is a sufficiently large stratum AND the selection in (2) results in that entire large stratum having already been sampled (no more families available in that stratum). Question: How do I do this? Do I need to do a nested do loop to get the result? Here is my proposed code(not working): data want; if _n_=1 then percent_to_select = pct* ranuni(12345); retain percent_to_select; set sample; if ranuni(13579) <= pct; run; Any suggestion or recommendation is appreciated. Thank you. Thanks, Siew
