Solved: Re: PROC SURVEYSELECT When Strata Overlap

doesper · Posted 12-16-2016 04:22 PM

Let's say you have a group of individuals, each uniquely identified by the variable CID (customer ID). Now, you want to assign each customer to one or more groups based on several attributes (TRAITs). Since these groups are not mutually exclusive, how can you tell PROC SURVEYSELECT to not select the same CID twice when generating a random sample for each TRAIT? In the simple example program I've included here, you'll see that CIDs 2 and 4 have both TRAITs A and B, and sometimes the luck of the draw is that CID 2 and/or 4 are included in the samples for both traits A and B. I don't want that to happen. I know this is an easy problem to solve with data step programming and multiple passes through PROC SURVEYSELECT, but I was hoping to do this with a single pass through PROC SURVEYSELECT.

Thanks,

Dave

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc print data=groups;
   title 'groups';
run;

proc surveyselect data=groups out=groups_sample sampsize=2 selectall;
   strata trait;
run;

proc print data=groups_sample;
   title 'groups_sample';
run;

PGStats · Posted 12-22-2016 11:32 PM

Make the sampling probability proportional to the number of traits then:

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample 
    method=pps sampsize=2 selectall;
strata trait;
size n;
run;

I think there might be an equivalent way of doing this with cluster sampling.

PG

View solution in original post

ballardw · Posted 12-16-2016 04:31 PM

Use multiple levels for strata until they don't overlap. You can have more than one variable. You may be better off with some indicator variables like TraitA = 1 when it has trait=A and 0 other wise, TraitB and so on. Then Strata TraitA TraitB TraitC....

Though if you want different proportions of each stratum you may get to spend some time building either a Sampsize or Samprate dataset or the value list for teh Sampsize or Samprate option.

PGStats · Posted 12-17-2016 11:44 PM

Since a client cannot be more than once in your sample, pick a trait at random for every cid, then pick a stratified sample.

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sort data=groups2; by cid rnd; run;

data groups3;
set groups2; by cid;
if first.cid;
drop rnd;
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
   strata trait;
run;

PG

doesper · Posted 12-22-2016 01:38 PM

Thanks, PGStats and ballardw. Using PGStats' approach gets me closer to the solution, but since CIDs 2 and 4 have two traits and the other CIDs only one, I want CID 2 and 4 to be twice as likely to be selected. In other words, weights. I was hoping I could use FREQ cid_weight in PROC SURVEYSELECT to do this, but, alas, it cannot be used in this manner.

Thanks,

Dave

data groups;
   input cid trait $ orders;
   rnd = rand('uniform');
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc sql;
   create table groups_weights as
    select cid
          ,count(*) as cid_weight
    from groups
    group by cid
;quit;

proc sort data=groups out=groups2;
   by cid rnd;
run;

data groups3;
   merge groups2 (in=a) groups_weights (in=b);
   by cid;
   if first.cid;
   drop rnd;
run;

proc sort data=groups3;
   by trait cid;
run;

proc print data=groups3;
   title 'groups3';
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
   strata trait;
run;

proc print data=groups_sample;
   title 'groups_sample';
run;

PGStats · Posted 12-22-2016 11:32 PM

Make the sampling probability proportional to the number of traits then:

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample 
    method=pps sampsize=2 selectall;
strata trait;
size n;
run;

I think there might be an equivalent way of doing this with cluster sampling.

PG

PROC SURVEYSELECT When Strata Overlap

Re: PROC SURVEYSELECT When Strata Overlap

Re: PROC SURVEYSELECT When Strata Overlap

Re: PROC SURVEYSELECT When Strata Overlap

Re: PROC SURVEYSELECT When Strata Overlap

Re: PROC SURVEYSELECT When Strata Overlap

Click image to register for webinar

Classroom Training Available!