Let's say you have a group of individuals, each uniquely identified by the variable CID (customer ID). Now, you want to assign each customer to one or more groups based on several attributes (TRAITs). Since these groups are not mutually exclusive, how can you tell PROC SURVEYSELECT to not select the same CID twice when generating a random sample for each TRAIT? In the simple example program I've included here, you'll see that CIDs 2 and 4 have both TRAITs A and B, and sometimes the luck of the draw is that CID 2 and/or 4 are included in the samples for both traits A and B. I don't want that to happen. I know this is an easy problem to solve with data step programming and multiple passes through PROC SURVEYSELECT, but I was hoping to do this with a single pass through PROC SURVEYSELECT.
Thanks,
Dave
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;
proc print data=groups;
title 'groups';
run;
proc surveyselect data=groups out=groups_sample sampsize=2 selectall;
strata trait;
run;
proc print data=groups_sample;
title 'groups_sample';
run;
Make the sampling probability proportional to the number of traits then:
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;
proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
I think there might be an equivalent way of doing this with cluster sampling.
Use multiple levels for strata until they don't overlap. You can have more than one variable. You may be better off with some indicator variables like TraitA = 1 when it has trait=A and 0 other wise, TraitB and so on. Then Strata TraitA TraitB TraitC....
Though if you want different proportions of each stratum you may get to spend some time building either a Sampsize or Samprate dataset or the value list for teh Sampsize or Samprate option.
Since a client cannot be more than once in your sample, pick a trait at random for every cid, then pick a stratified sample.
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sort data=groups2; by cid rnd; run;
data groups3;
set groups2; by cid;
if first.cid;
drop rnd;
run;
proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;
Thanks, PGStats and ballardw. Using PGStats' approach gets me closer to the solution, but since CIDs 2 and 4 have two traits and the other CIDs only one, I want CID 2 and 4 to be twice as likely to be selected. In other words, weights. I was hoping I could use FREQ cid_weight in PROC SURVEYSELECT to do this, but, alas, it cannot be used in this manner.
Thanks,
Dave
data groups;
input cid trait $ orders;
rnd = rand('uniform');
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;
proc sql;
create table groups_weights as
select cid
,count(*) as cid_weight
from groups
group by cid
;quit;
proc sort data=groups out=groups2;
by cid rnd;
run;
data groups3;
merge groups2 (in=a) groups_weights (in=b);
by cid;
if first.cid;
drop rnd;
run;
proc sort data=groups3;
by trait cid;
run;
proc print data=groups3;
title 'groups3';
run;
proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;
proc print data=groups_sample;
title 'groups_sample';
run;
Make the sampling probability proportional to the number of traits then:
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;
proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
I think there might be an equivalent way of doing this with cluster sampling.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.