Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How to get unequal sized stratified groups from PROC SURVEYSELECT?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-29-2019 06:56 PM
(3302 views)

Hi!

I have a very simple sampling request. I want to randomly partition a dataset into two groups, stratified on a binary variable. The original data has 1000 observations, 300 with binvar=0, 700 with binvar=1. If I do:

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=2;

STRATA binvar;

RUN;

I get two groups, identified in the output data set by the variable GroupID. Each group has 500 observations, 150 with binvar=0 and 350 with binvar=1.

Now, how do I get the groups to be of different sizes but still stratified? I want one group of size 600, with 180 binvar=0 and 420 binvar=1, and a second group of size 400, with 120 binvar=0 and 280 binvar=1. If I do:

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400);

RUN;

I get the two right size groups, but not the exact binvar counts in each one. If I do:

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400);

STRATA binvar;

RUN;

I get an error message:

ERROR: The sum of the GROUPS= values must equal the total number of units.

NOTE: The above message was for the following stratum:

binvar=0.

ERROR: The sum of the GROUPS= values must equal the total number of units.

NOTE: The above message was for the following stratum:

binvar=1.

Can PROC SURVEYSELECT do what I want?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For two groups, you can use N=(180,420) together with the OUTALL option (which includes all observations, selected and not selected, in the output data set).

```
proc surveyselect data=inds out=outds n=(180,420) outall;
strata binvar;
```

For binvar=0, this selects 180 obs for one group (output data set variable 'Selected'=1) and the remaining 120 obs ('Selected'=0) for the other group..

For binvar=1, this selects 420 obs for one group and the remaining 280 obs for the other group.

7 REPLIES 7

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For strata you need to add the allocation option - check the docs for the exact syntax. And...because you’re essentially doing two reps with different sizes, not two groups the easiest solution I can see is multiple PROC SURVEYSELECT calls.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Seems like the GROUPS=(n1 n2) option in SURVEYSELECT requires all strata to be the same size. Probably simpler to use a data step program:

```
data class;
call streaminit(768631);
set sashelp.class;
rnd = rand("uniform");
run;
proc sort data=class; by sex rnd; run;
data sample;
do _i = 1 by 1 until(last.sex);
set class; by sex;
end;
do _j = 1 by 1 until(last.sex);
set class; by sex;
grp = (_j / _i) > 0.3;
output;
end;
drop rnd _i _j;
run;
```

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For two groups, you can use N=(180,420) together with the OUTALL option (which includes all observations, selected and not selected, in the output data set).

```
proc surveyselect data=inds out=outds n=(180,420) outall;
strata binvar;
```

For binvar=0, this selects 180 obs for one group (output data set variable 'Selected'=1) and the remaining 120 obs ('Selected'=0) for the other group..

For binvar=1, this selects 420 obs for one group and the remaining 280 obs for the other group.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you @Watts !

That does the trick. Thank you also to @Reeza and @PGStats for contributing.

But the method of choosing one group and then doing OUTALL to get the rest to come along only works for two unequal sized stratified groups. I'm curious about whether there is a way to get more than two unequal sized groups while maintaining the strata proportions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Top_Katz wrote:

Thank you @Watts !

That does the trick. Thank you also to @Reeza and @PGStats for contributing.

But the method of choosing one group and then doing OUTALL to get the rest to come along only works for two unequal sized stratified groups. I'm curious about whether there is a way to get more than two unequal sized groups while maintaining the strata proportions.

Details about what you might mean by "unequal group: and "maintaining strata proportions".

You can specify a sample size (n or sampsize), number of observations, or a sample rate (samprate or rate) for each level of your STRATA variable. If the strata variable has 5 levels and I want to sample at 20%, 15%, 33%, 10% and 18% for the strata then

samprate = (20 15 33 10 18) or samprate=(.2 .15 .33 .1 .18) would work.

If you have two strata variables the listing order gets a bit more complicated but each combination of the strata variables gets a single rate.

Or N if you want to sample 5 different numbers of observations.

There are also some options on the STRATA statement such as ALLOC or ALLOCMIN or MARGIN that may be of interest.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi @ballardw !

Thank you for responding. My strata variable, binvar, has two levels. The original data set has 1,000 records, 300 with binvar=0 and 700 with binvar=1. Suppose I want to create three sample groups. If I do:

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=3;

STRATA binvar;

RUN;

then I get two groups with 100 binvar=0 and 233 binvar=1, and a third group with 100 binvar=0 and 234 binvar=1. It tries to make the groups of equal size, and it maintains the overall strata variable proportionality within each group.

But suppose I want three groups:

group one with 500 total records, 150 binvar=0 and 350 binvar=1

group two with 300 total records, 90 binvar=0 and 210 binvar=1

group three with 200 total records, 60 binvar=0 and 140 binvar=1

Can I do that with one call to PROC SURVEYSELECT? Or do I have to do successive calls? For example, I could do:

PROC SURVEYSELECT DATA=INDS OUT=OUTDS1 GROUPS=2;

STRATA binvar;

RUN;

PROC SURVEYSELECT DATA=OUTDS1 (WHERE=(GroupID=2)) OUT=OUTDS2 N=(90,21) OUTALL;

STRATA binvar;

RUN;

DATA OUTDS;

SET OUTDS1 (WHERE=(GroupID=1)) OUTDS2;

GroupID=IFN((Selected=0),3,GroupID);

RUN;

This gives me the three groups I sought, but it's clunky. Is it possible to do all this with one PROC SURVEYSELECT call?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In its current version, proc surveyselect isn't capable of handling your request for more than 2 groups. However, the workaround I provided can easily be extended to three or more groups. For example :

`grp = ((_j / _i) > 0.3) + ((_j / _i) > 0.6);`

PG

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.