BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
doesper
Obsidian | Level 7

Let's say you have a group of individuals, each uniquely identified by the variable CID (customer ID).  Now, you want to assign each customer to one or more groups based on several attributes (TRAITs).  Since these groups are not mutually exclusive, how can you tell PROC SURVEYSELECT to not select the same CID twice when generating a random sample for each TRAIT?  In the simple example program I've included here, you'll see that CIDs 2 and 4 have both TRAITs A and B, and sometimes the luck of the draw is that CID 2 and/or 4 are included in the samples for both traits A and B.  I don't want that to happen.  I know this is an easy problem to solve with data step programming and multiple passes through PROC SURVEYSELECT, but I was hoping to do this with a single pass through PROC SURVEYSELECT.

 

Thanks,

 

Dave

 

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc print data=groups;
   title 'groups';
run;

proc surveyselect data=groups out=groups_sample sampsize=2 selectall;
   strata trait;
run;

proc print data=groups_sample;
   title 'groups_sample';
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Make the sampling probability proportional to the number of traits then:

 

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample 
    method=pps sampsize=2 selectall;
strata trait;
size n;
run;

I think there might be an equivalent way of doing this with cluster sampling.

PG

View solution in original post

4 REPLIES 4
ballardw
Super User

Use multiple levels for strata until they don't overlap. You can have more than one variable. You may be better off with some indicator variables like TraitA = 1 when it has trait=A and 0 other wise, TraitB and so on. Then Strata TraitA TraitB TraitC....

Though if you want different proportions of each stratum you may get to spend some time building either a Sampsize or Samprate dataset or the value  list for teh Sampsize or Samprate option.

PGStats
Opal | Level 21

Since a client cannot be more than once in your sample, pick a trait at random for every cid, then pick a stratified sample.

 

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sort data=groups2; by cid rnd; run;

data groups3;
set groups2; by cid;
if first.cid;
drop rnd;
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
   strata trait;
run;
PG
doesper
Obsidian | Level 7

Thanks, PGStats and ballardw.  Using PGStats' approach gets me closer to the solution, but since CIDs 2 and 4 have two traits and the other CIDs only one, I want CID 2 and 4 to be twice as likely to be selected.  In other words, weights.  I was hoping I could use FREQ cid_weight in PROC SURVEYSELECT to do this, but, alas, it cannot be used in this manner.

 

Thanks,

 

Dave

 

data groups;
   input cid trait $ orders;
   rnd = rand('uniform');
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;

proc sql;
   create table groups_weights as
    select cid
          ,count(*) as cid_weight
    from groups
    group by cid
;quit;

proc sort data=groups out=groups2;
   by cid rnd;
run;

data groups3;
   merge groups2 (in=a) groups_weights (in=b);
   by cid;
   if first.cid;
   drop rnd;
run;

proc sort data=groups3;
   by trait cid;
run;

proc print data=groups3;
   title 'groups3';
run;

proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
   strata trait;
run;

proc print data=groups_sample;
   title 'groups_sample';
run; 

 

PGStats
Opal | Level 21

Make the sampling probability proportional to the number of traits then:

 

data groups;
   input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;

data groups2;
set groups;
rnd = rand("uniform");
run;

proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;

proc surveyselect data=groups3 out=groups_sample 
    method=pps sampsize=2 selectall;
strata trait;
size n;
run;

I think there might be an equivalent way of doing this with cluster sampling.

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1315 views
  • 0 likes
  • 3 in conversation