01-08-2013 09:59 AM
I've been reading through the documentation on this procedure and I cannot explicitly find (or perhaps identify) the appropriate syntax to accomodate a specific sampling objective. I am looking to generate a random sample without replacement on a dataset that contains claims adjudicated by different users. Because some users process more claims than others, I want the sampling methodology to assign each user an equal probability of being chosen in the random sample (I do not need a specific number [or proportion] of observations in the random sample from each user as many of the documentation examples show). This will purge the bias related to claims examiners who process more claims having a higher chance of being "randomly" chosen because more of their claim lines exist on the dataset.
I will provide a couple of arbitrary parameters so someone can illustrate the appropriate syntax.
Random Sample Size: 100
Variable identifying the claims examiner: User
01-08-2013 01:32 PM
If you want equal probability of being selected based on User, why not do something like:
/* Make up some data */
do user=1 to 1000;
do claim=1 to 2000;
/* Make it so that some of the users have more claims than others, so that this is a little more general */
if flag=0 and claim>1000 then delete;
/* Give every record a unique random number */
/* This is the point where you can use your own "have" dataset */
/* Sort now so that records are randomized within each user */
proc sort data=firstpass out=secondpass;
by user ranno;
/* Select the record for each user with the lowest random value */
by user ranno;
/* Now resort across all users in preparation of selecting with equal opportunity for selection */
proc sort data=thirdpass out=fourthpass;
/* Select 100 random users, with one of several records still attached */
If you want all or multiple records for the selected users, you can then merge this back against the original data set.
(And I am sure there is a way to do this in PROC SURVEYSELECT, but this is pretty fast, and pretty straightforward. I think can come up with something better than this. Despite all the sorting and subsetting, this took about 1.7 seconds cpu time, and a little less real time due to multithreading).
01-08-2013 03:48 PM
Thanks Steve! Your example was very intuitive and straightforward.
For anyone else reading this, I am looking for a SAS book that focuses entirely on sampling as I will be involved a lot more moving forward on this type of work.
01-08-2013 09:20 PM
Hi! If you are going to be involved in a lot of sampling then you should definitely get familiar with SURVEYSELECT. I haven't a full knowledge of the procedure but, anyway, here is how I would accomplish your sampling :
do user = 1 to 30;
do j = 1 to 1 + rand("POISSON",3);
claim = user * 100 + j;
/* Select some user's claims as clusters */
proc surveyselect data=test out=temp n=10;
/* Select one claim per selected user */
proc surveyselect data=temp out=want n=1;