turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- PROC SURVEYSELECT When Strata Overlap

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-16-2016 04:22 PM

Let's say you have a group of individuals, each uniquely identified by the variable CID (customer ID). Now, you want to assign each customer to one or more groups based on several attributes (TRAITs). Since these groups are not mutually exclusive, how can you tell PROC SURVEYSELECT to not select the same CID twice when generating a random sample for each TRAIT? In the simple example program I've included here, you'll see that CIDs 2 and 4 have both TRAITs A and B, and sometimes the luck of the draw is that CID 2 and/or 4 are included in the samples for both traits A and B. I don't want that to happen. I know this is an easy problem to solve with data step programming and multiple passes through PROC SURVEYSELECT, but I was hoping to do this with a single pass through PROC SURVEYSELECT.

Thanks,

Dave

```
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;
proc print data=groups;
title 'groups';
run;
proc surveyselect data=groups out=groups_sample sampsize=2 selectall;
strata trait;
run;
proc print data=groups_sample;
title 'groups_sample';
run;
```

Accepted Solutions

Solution

01-03-2017
07:03 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doesper

12-22-2016 11:32 PM

Make the sampling probability proportional to the number of traits then:

```
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;
proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
```

I think there might be an equivalent way of doing this with cluster sampling.

PG

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doesper

12-16-2016 04:31 PM

Use multiple levels for strata until they don't overlap. You can have more than one variable. You may be better off with some indicator variables like TraitA = 1 when it has trait=A and 0 other wise, TraitB and so on. Then Strata TraitA TraitB TraitC....

Though if you want different proportions of each stratum you may get to spend some time building either a Sampsize or Samprate dataset or the value list for teh Sampsize or Samprate option.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doesper

12-17-2016 11:44 PM

Since a client cannot be more than once in your sample, pick a trait at random for every cid, then pick a stratified sample.

```
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sort data=groups2; by cid rnd; run;
data groups3;
set groups2; by cid;
if first.cid;
drop rnd;
run;
proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;
```

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

12-22-2016 01:38 PM

Thanks, PGStats and ballardw. Using PGStats' approach gets me closer to the solution, but since CIDs 2 and 4 have two traits and the other CIDs only one, I want CID 2 and 4 to be twice as likely to be selected. In other words, weights. I was hoping I could use FREQ cid_weight in PROC SURVEYSELECT to do this, but, alas, it cannot be used in this manner.

Thanks,

Dave

```
data groups;
input cid trait $ orders;
rnd = rand('uniform');
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
run;
proc sql;
create table groups_weights as
select cid
,count(*) as cid_weight
from groups
group by cid
;quit;
proc sort data=groups out=groups2;
by cid rnd;
run;
data groups3;
merge groups2 (in=a) groups_weights (in=b);
by cid;
if first.cid;
drop rnd;
run;
proc sort data=groups3;
by trait cid;
run;
proc print data=groups3;
title 'groups3';
run;
proc surveyselect data=groups3 out=groups_sample sampsize=2 selectall;
strata trait;
run;
proc print data=groups_sample;
title 'groups_sample';
run;
```

Solution

01-03-2017
07:03 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doesper

12-22-2016 11:32 PM

Make the sampling probability proportional to the number of traits then:

```
data groups;
input cid trait $ orders;
datalines;
1 A 2
2 A 4
3 A 6
4 A 8
5 A 10
2 B 4
4 B 8
6 B 16
7 B 18
;
data groups2;
set groups;
rnd = rand("uniform");
run;
proc sql;
create table groups3 as
select cid, trait, orders, count(*) as n
from groups2
group by cid
having rnd = min(rnd);
quit;
proc surveyselect data=groups3 out=groups_sample
method=pps sampsize=2 selectall;
strata trait;
size n;
run;
```

I think there might be an equivalent way of doing this with cluster sampling.

PG