I've been reading through the documentation on this procedure and I cannot explicitly find (or perhaps identify) the appropriate syntax to accomodate a specific sampling objective. I am looking to generate a random sample without replacement on a dataset that contains claims adjudicated by different users. Because some users process more claims than others, I want the sampling methodology to assign each user an equal probability of being chosen in the random sample (I do not need a specific number [or proportion] of observations in the random sample from each user as many of the documentation examples show). This will purge the bias related to claims examiners who process more claims having a higher chance of being "randomly" chosen because more of their claim lines exist on the dataset.
I will provide a couple of arbitrary parameters so someone can illustrate the appropriate syntax.
Claimlines: 1,000,000
Random Sample Size: 100
Variable identifying the claims examiner: User
Thanks!
If you want equal probability of being selected based on User, why not do something like:
/* Make up some data */
data have;
do user=1 to 1000;
do claim=1 to 2000;
flag=mod(user,2);
output;
end;
end;
run;
/* Make it so that some of the users have more claims than others, so that this is a little more general */
data have;
set have;
if flag=0 and claim>1000 then delete;
run;
/* Give every record a unique random number */
/* This is the point where you can use your own "have" dataset */
data firstpass;
call streaminit(111);
set have;
ranno=rand('UNIFORM');
run;
/* Sort now so that records are randomized within each user */
proc sort data=firstpass out=secondpass;
by user ranno;
run;
/* Select the record for each user with the lowest random value */
data thirdpass;
set secondpass;
by user ranno;
if first.user;
run;
/* Now resort across all users in preparation of selecting with equal opportunity for selection */
proc sort data=thirdpass out=fourthpass;
by ranno;
run;
/* Select 100 random users, with one of several records still attached */
data want;
set fourthpass;
if _n_<=100;
run;
If you want all or multiple records for the selected users, you can then merge this back against the original data set.
(And I am sure there is a way to do this in PROC SURVEYSELECT, but this is pretty fast, and pretty straightforward. I think can come up with something better than this. Despite all the sorting and subsetting, this took about 1.7 seconds cpu time, and a little less real time due to multithreading).
Steve Denham
Steve Denham
Thanks Steve! Your example was very intuitive and straightforward.
For anyone else reading this, I am looking for a SAS book that focuses entirely on sampling as I will be involved a lot more moving forward on this type of work.
Thanks!
Hi! If you are going to be involved in a lot of sampling then you should definitely get familiar with SURVEYSELECT. I haven't a full knowledge of the procedure but, anyway, here is how I would accomplish your sampling :
data test;
call streaminit(3746);
do user = 1 to 30;
do j = 1 to 1 + rand("POISSON",3);
claim = user * 100 + j;
output;
end;
end;
drop j;
run;
/* Select some user's claims as clusters */
proc surveyselect data=test out=temp n=10;
samplingunit user;
run;
/* Select one claim per selected user */
proc surveyselect data=temp out=want n=1;
strata user;
run;
PG
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.