## PROC SURVEYSELECT When Strata Overlap

Solved
Occasional Contributor
Posts: 15

# PROC SURVEYSELECT When Strata Overlap

Let's say you have a group of individuals, each uniquely identified by the variable CID (customer ID).  Now, you want to assign each customer to one or more groups based on several attributes (TRAITs).  Since these groups are not mutually exclusive, how can you tell PROC SURVEYSELECT to not select the same CID twice when generating a random sample for each TRAIT?  In the simple example program I've included here, you'll see that CIDs 2 and 4 have both TRAITs A and B, and sometimes the luck of the draw is that CID 2 and/or 4 are included in the samples for both traits A and B.  I don't want that to happen.  I know this is an easy problem to solve with data step programming and multiple passes through PROC SURVEYSELECT, but I was hoping to do this with a single pass through PROC SURVEYSELECT.

Thanks,

Dave

``````data groups;
input cid trait \$ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc print data=groups;
title 'groups';
run;

proc surveyselect data=groups out=groups_sample sampsize=2 selectall;
strata trait;
run;

proc print data=groups_sample;
title 'groups_sample';
run;``````

Accepted Solutions
Solution
‎01-03-2017 07:03 PM
Posts: 5,526

## Re: PROC SURVEYSELECT When Strata Overlap

Make the sampling probability proportional to the number of traits then:

``````data groups;
input cid trait \$ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
``````

I think there might be an equivalent way of doing this with cluster sampling.

PG

All Replies
Super User
Posts: 13,528

## Re: PROC SURVEYSELECT When Strata Overlap

Use multiple levels for strata until they don't overlap. You can have more than one variable. You may be better off with some indicator variables like TraitA = 1 when it has trait=A and 0 other wise, TraitB and so on. Then Strata TraitA TraitB TraitC....

Though if you want different proportions of each stratum you may get to spend some time building either a Sampsize or Samprate dataset or the value  list for teh Sampsize or Samprate option.

Posts: 5,526

## Re: PROC SURVEYSELECT When Strata Overlap

Since a client cannot be more than once in your sample, pick a trait at random for every cid, then pick a stratified sample.

``````data groups2;
set groups;
rnd = rand("uniform");
run;

proc sort data=groups2; by cid rnd; run;

data groups3;
set groups2; by cid;
if first.cid;
drop rnd;
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;
``````
PG
Occasional Contributor
Posts: 15

## Re: PROC SURVEYSELECT When Strata Overlap

Thanks, PGStats and ballardw.  Using PGStats' approach gets me closer to the solution, but since CIDs 2 and 4 have two traits and the other CIDs only one, I want CID 2 and 4 to be twice as likely to be selected.  In other words, weights.  I was hoping I could use FREQ cid_weight in PROC SURVEYSELECT to do this, but, alas, it cannot be used in this manner.

Thanks,

Dave

``````data groups;
input cid trait \$ orders;
rnd = rand('uniform');
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc sql;
create table groups_weights as
select cid
,count(*) as cid_weight
from groups
group by cid
;quit;

proc sort data=groups out=groups2;
by cid rnd;
run;

data groups3;
merge groups2 (in=a) groups_weights (in=b);
by cid;
if first.cid;
drop rnd;
run;

proc sort data=groups3;
by trait cid;
run;

proc print data=groups3;
title 'groups3';
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;

proc print data=groups_sample;
title 'groups_sample';
run;``` ```

Solution
‎01-03-2017 07:03 PM
Posts: 5,526

## Re: PROC SURVEYSELECT When Strata Overlap

Make the sampling probability proportional to the number of traits then:

``````data groups;
input cid trait \$ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
``````

I think there might be an equivalent way of doing this with cluster sampling.

PG
☑ This topic is solved.