BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Top_Katz
Quartz | Level 8

Hi!

 

I have a very simple sampling request.  I want to randomly partition a dataset into two groups, stratified on a binary variable.  The original data has 1000 observations, 300 with binvar=0, 700 with binvar=1.  If I do:

 

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=2;

STRATA binvar;

RUN;

 

I get two groups, identified in the output data set by the variable GroupID.  Each group has 500 observations, 150 with binvar=0 and 350 with binvar=1.

 

Now, how do I get the groups to be of different sizes but still stratified?  I want one group of size 600, with 180 binvar=0 and 420 binvar=1, and a second group of size 400, with 120 binvar=0 and 280 binvar=1.  If I do:

 

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400);

RUN;

 

I get the two right size groups, but not the exact binvar counts in each one.  If I do:

 

PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400);

STRATA binvar;

RUN;

 

I get an error message:

 

ERROR: The sum of the GROUPS= values must equal the total number of units.
NOTE: The above message was for the following stratum:
binvar=0.
ERROR: The sum of the GROUPS= values must equal the total number of units.
NOTE: The above message was for the following stratum:
binvar=1.

 

Can PROC SURVEYSELECT do what I want?

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Watts
SAS Employee

For two groups, you can use N=(180,420) together with the OUTALL option (which includes all observations, selected and not selected, in the output data set).

 

proc surveyselect data=inds out=outds n=(180,420) outall;
strata binvar;

 

For binvar=0, this selects 180 obs for one group (output data set variable 'Selected'=1) and the remaining 120 obs ('Selected'=0) for the other group..

 

For binvar=1, this selects 420 obs for one group and the remaining 280 obs for the other group.

 

View solution in original post

7 REPLIES 7
Reeza
Super User
For strata you need to add the allocation option - check the docs for the exact syntax. And...because you’re essentially doing two reps with different sizes, not two groups the easiest solution I can see is multiple PROC SURVEYSELECT calls.
PGStats
Opal | Level 21

Seems like the GROUPS=(n1 n2) option in SURVEYSELECT requires all strata to be the same size. Probably simpler to use a data step program:

 

data class;
call streaminit(768631);
set sashelp.class;
rnd = rand("uniform");
run;

proc sort data=class; by sex rnd; run;

data sample;
do _i = 1 by 1 until(last.sex); 
    set class; by sex;
    end;
do _j = 1 by 1 until(last.sex);
    set class; by sex;
    grp = (_j / _i) > 0.3;
    output;
    end;
drop rnd _i _j;
run;
PG
Watts
SAS Employee

For two groups, you can use N=(180,420) together with the OUTALL option (which includes all observations, selected and not selected, in the output data set).

 

proc surveyselect data=inds out=outds n=(180,420) outall;
strata binvar;

 

For binvar=0, this selects 180 obs for one group (output data set variable 'Selected'=1) and the remaining 120 obs ('Selected'=0) for the other group..

 

For binvar=1, this selects 420 obs for one group and the remaining 280 obs for the other group.

 

Top_Katz
Quartz | Level 8

Thank you @Watts !

 

That does the trick.  Thank you also to @Reeza and @PGStats for contributing.

 

But the method of choosing one group and then doing OUTALL to get the rest to come along only works for two unequal sized stratified groups.  I'm curious about whether there is a way to get more than two unequal sized groups while maintaining the strata proportions.

ballardw
Super User

@Top_Katz wrote:

Thank you @Watts !

 

That does the trick.  Thank you also to @Reeza and @PGStats for contributing.

 

But the method of choosing one group and then doing OUTALL to get the rest to come along only works for two unequal sized stratified groups.  I'm curious about whether there is a way to get more than two unequal sized groups while maintaining the strata proportions.


Details about what you might mean by "unequal group: and "maintaining strata proportions".

You can specify a sample size (n or sampsize), number of observations, or a sample rate (samprate or rate) for each level of your STRATA variable. If the strata variable has 5 levels and I want to sample at 20%, 15%, 33%, 10% and 18% for the strata then

samprate = (20 15 33 10 18) or samprate=(.2 .15 .33 .1 .18) would work.

If you have two strata variables the listing order gets a bit more complicated but each combination of the strata variables gets a single rate.

Or N if you want to sample 5 different numbers of observations.

There are also some options on the STRATA statement such as ALLOC or ALLOCMIN or MARGIN that may be of interest.

Top_Katz
Quartz | Level 8

Hi @ballardw !

 

Thank you for responding.  My strata variable, binvar, has two levels.  The original data set has 1,000 records, 300 with binvar=0 and 700 with binvar=1.  Suppose I want to create three sample groups.  If I do:


PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=3;

STRATA binvar;

RUN;

 

then I get two groups with 100 binvar=0 and 233 binvar=1, and a third group with 100 binvar=0 and 234 binvar=1.  It tries to make the groups of equal size, and it maintains the overall strata variable proportionality within each group.

 

But suppose I want three groups:

group one with 500 total records, 150 binvar=0 and 350 binvar=1

group two with 300 total records, 90 binvar=0 and 210 binvar=1

group three with 200 total records, 60 binvar=0 and 140 binvar=1

 

Can I do that with one call to PROC SURVEYSELECT?  Or do I have to do successive calls?  For example, I could do:

 

 

PROC SURVEYSELECT DATA=INDS OUT=OUTDS1 GROUPS=2;

STRATA binvar;

RUN;

 

PROC SURVEYSELECT DATA=OUTDS1 (WHERE=(GroupID=2)) OUT=OUTDS2 N=(90,21) OUTALL;

STRATA binvar;

RUN;

 

DATA OUTDS;

SET OUTDS1 (WHERE=(GroupID=1)) OUTDS2;

GroupID=IFN((Selected=0),3,GroupID);

RUN;

 

 

This gives me the three groups I sought, but it's clunky.  Is it possible to do all this with one PROC SURVEYSELECT call?

 

PGStats
Opal | Level 21

In its current version, proc surveyselect isn't capable of handling your request for more than 2 groups. However, the workaround I provided can easily be extended to three or more groups. For example :

 

grp = ((_j / _i) > 0.3) + ((_j / _i) > 0.6);
PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 3139 views
  • 5 likes
  • 5 in conversation