Proc Surveyselect

Data_Detective_23219 · Posted 01-08-2013 09:59 AM

I've been reading through the documentation on this procedure and I cannot explicitly find (or perhaps identify) the appropriate syntax to accomodate a specific sampling objective. I am looking to generate a random sample without replacement on a dataset that contains claims adjudicated by different users. Because some users process more claims than others, I want the sampling methodology to assign each user an equal probability of being chosen in the random sample (I do not need a specific number [or proportion] of observations in the random sample from each user as many of the documentation examples show). This will purge the bias related to claims examiners who process more claims having a higher chance of being "randomly" chosen because more of their claim lines exist on the dataset.

I will provide a couple of arbitrary parameters so someone can illustrate the appropriate syntax.

Claimlines: 1,000,000

Random Sample Size: 100

Variable identifying the claims examiner: User

Thanks!

SteveDenham · Posted 01-08-2013 01:32 PM

If you want equal probability of being selected based on User, why not do something like:

/* Make up some data */

data have;
do user=1 to 1000;
do claim=1 to 2000;
flag=mod(user,2);
output;
end;
end;
run;

/* Make it so that some of the users have more claims than others, so that this is a little more general */
data have;
set have;
if flag=0 and claim>1000 then delete;
run;

/* Give every record a unique random number */

/* This is the point where you can use your own "have" dataset */

data firstpass;

call streaminit(111);

set have;

ranno=rand('UNIFORM');

run;

/* Sort now so that records are randomized within each user */

proc sort data=firstpass out=secondpass;

by user ranno;

run;

/* Select the record for each user with the lowest random value */

data thirdpass;

set secondpass;

by user ranno;

if first.user;

run;

/* Now resort across all users in preparation of selecting with equal opportunity for selection */

proc sort data=thirdpass out=fourthpass;

by ranno;

run;

/* Select 100 random users, with one of several records still attached */

data want;

set fourthpass;

if _n_<=100;

run;

If you want all or multiple records for the selected users, you can then merge this back against the original data set.

(And I am sure there is a way to do this in PROC SURVEYSELECT, but this is pretty fast, and pretty straightforward. I think can come up with something better than this. Despite all the sorting and subsetting, this took about 1.7 seconds cpu time, and a little less real time due to multithreading).

Steve Denham

Data_Detective_23219 · Posted 01-08-2013 03:48 PM

Thanks Steve! Your example was very intuitive and straightforward.

For anyone else reading this, I am looking for a SAS book that focuses entirely on sampling as I will be involved a lot more moving forward on this type of work.

Thanks!

PGStats · Posted 01-08-2013 09:20 PM

Hi! If you are going to be involved in a lot of sampling then you should definitely get familiar with SURVEYSELECT. I haven't a full knowledge of the procedure but, anyway, here is how I would accomplish your sampling :

data test;
call streaminit(3746);
do user = 1 to 30;
      do j = 1 to 1 + rand("POISSON",3);
          claim = user * 100 + j;
          output;
          end;
     end;
drop j;
run;

/* Select some user's claims as clusters */

proc surveyselect data=test out=temp n=10;
samplingunit user;
run;

/* Select one claim per selected user */

proc surveyselect data=temp out=want n=1;
strata user;
run;

PG

Proc Surveyselect

Re: Proc Surveyselect

Re: Proc Surveyselect

Re: Proc Surveyselect

Catch up on SAS Innovate 2026

Proc Surveyselect

Re: Proc Surveyselect

Re: Proc Surveyselect

Re: Proc Surveyselect

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away