DATA Step, Macro, Functions and more

Randomly selecting 5 case per cluster

Occasional Contributor ncy
Occasional Contributor
Posts: 5

Randomly selecting 5 case per cluster

Hi there, 


I am working on a case control study, for each case, I have 20 controls, but I would like to further select a smaller sample with 5 controls per case. How can I select it randomly without replacement? I have googled it and seems like proc surveyselect could be a good solution for my case, however I don't know how to specify the parameters to get what I want, anyone got any ideas?


A sample dataset would look like this:


data have;

input n id;


1 12 

1 13

1 14

1 15

1 16

1 17

1 18

2 35

2 40

2 56

2 57

2 58

2 59

2 60




where n refers to case id, and id refers to control id, I would like to select 5 controls per case randomly. 




Super User
Posts: 13,507

Re: Randomly selecting 5 case per cluster

proc surveyselect data=have
   out=selected sampsize=5 outall;
   strata n;

The rule when you say something like nn per value of a variable is that the variable is a STRATA for surveyselect. The input set has to be sorted by the strata variable. The sampsize option has how many records per strata are desired. If you have different sizes per strata that can be accomplished by listing the sizes in order of the strata variable values such as sampsize(5 6 4) would say take 5 from the first strata, 6 from the second and 4 from the last.


I used the OUTALL option to create a set with all of your starting records and an added variable named Selected which has a value of 1 for the selected records. Notice that SAS also adds a selection probability and a sampling weight. Feel free to drop them if aren't going to use the weights for anything later.

Ask a Question
Discussion stats
  • 1 reply
  • 2 in conversation