Solved: Stratified sampling

Kraussvan · Posted 04-04-2023 06:57 PM

Hello,

I have what I think is a simple task, or, a set of simple tasks all together and I can't figure out how to do it.

a) Imagine that I have one table with 2 columns: "ID", "cluster" (cluster={1,2}) with 200 rows.

b) I want a new table with: "ID","cluster", "GROUP". that is composed by 3 samples of random rows, with "strata" by cluster, BUT:

-I want different sizes for each cluster

-I want to determine that sizes (not with percentage)

-The ID's are not replaceable (they are key in the first and last table)

In the end, I will get some counts like this:

GROUP:1 CLUSTER:1 Number of ID's: 15

GROUP:1 CLUSTER:2 Number of ID's: 25

GROUP:2 CLUSTER:1 Number of ID's: 15

GROUP:2 CLUSTER:2 Number of ID's: 25

GROUP:3 CLUSTER:1 Number of ID's: 15

GROUP:3 CLUSTER:2 Number of ID's: 25

The closer I got was using the "proc surveyselect", but can't guarantee that ID's are not repeated and can't choose different sizes for different cluster.

Can anyone out there help me, please?

Thank you in advance

KV

FreelanceReinh · Posted 04-05-2023 07:24 AM

Hello @Kraussvan and welcome to the SAS Support Communities!

You can also use the GROUPS= option of the PROC SURVEYSELECT statement to perform the group assignments:

/* Create example data for demonstration */

data have;
do id=1 to 200;
  cluster=2-~mod(id,3);
  output;
end;
run; /* 66 obs. in cluster 1, 134 in cluster 2 */

/* Draw stratified random sample of
     3*15=45 obs. from cluster 1,
     3*25=75 obs. from cluster 2 */

proc sort data=have;
by cluster;
run;

proc surveyselect data=have
method=srs n=(45 75)
seed=2718 out=samp;
strata cluster;
run;

/* Randomly assign cluster samples to three groups */

proc surveyselect data=samp groups=3
seed=3141 out=want(drop=s:);
strata cluster;
run;

/* Verify that frequencies match the requirement */

proc freq data=want;
tables groupID*cluster / list;
run;

View solution in original post

Ksharp · Posted 04-05-2023 06:24 AM

You want this ?

data have;
call streaminit(123);
do id=1 to 200;
 cluster=rand('bern',0.5)+1;
 output;
end;
run;


proc sort data=have out=have1;by cluster;run;
proc surveyselect data=have1 out=temp1 seed=123 sampsize=(15 25);
strata  cluster;
run;
data want1;
 set temp1;
 group=1;
 keep group cluster id;
run;


proc sql;
create table have2 as
select * from have1 where id not in (select id from want1);
quit;
proc surveyselect data=have2 out=temp2 seed=123 sampsize=(15 25);
strata  cluster;
run;
data want2;
 set temp2;
 group=2;
 keep group cluster id;
run;



proc sql;
create table have3 as
select * from have2 where id not in (select id from want2);
quit;
proc surveyselect data=have3 out=temp3 seed=123 sampsize=(15 25);
strata  cluster;
run;
data want3;
 set temp3;
 group=3;
 keep group cluster id;
run;


data all;
 set want1-want3;
run;
proc freq data=all;
table group*cluster/list;
run;

Kraussvan · Posted 04-05-2023 09:53 AM

Hi Ksharp,
Thanks for your willingness to help. Your approach is to build blocks that eventually come together and use "surveyselect" only for random choice. My question is more about how to get the prerequisites compactly within the "surveyselect".
But it works, no question about it. Thanks a lot for your time!

FreelanceReinh · Posted 04-05-2023 07:24 AM

Hello @Kraussvan and welcome to the SAS Support Communities!

You can also use the GROUPS= option of the PROC SURVEYSELECT statement to perform the group assignments:

/* Create example data for demonstration */

data have;
do id=1 to 200;
  cluster=2-~mod(id,3);
  output;
end;
run; /* 66 obs. in cluster 1, 134 in cluster 2 */

/* Draw stratified random sample of
     3*15=45 obs. from cluster 1,
     3*25=75 obs. from cluster 2 */

proc sort data=have;
by cluster;
run;

proc surveyselect data=have
method=srs n=(45 75)
seed=2718 out=samp;
strata cluster;
run;

/* Randomly assign cluster samples to three groups */

proc surveyselect data=samp groups=3
seed=3141 out=want(drop=s:);
strata cluster;
run;

/* Verify that frequencies match the requirement */

proc freq data=want;
tables groupID*cluster / list;
run;

Kraussvan · Posted 04-05-2023 10:02 AM

Hi FreelanceReinh,

This approach is much closer to what I was looking for.
The "surveyselect" parameters already ensure non-replacement and division by stratified groups.
The only improvement that could be made (I don't know if it's possible) is in the "surveyselect" that creates the 3 groups using a table with a greater number of records but it chooses only the number that is intended. (15 20 per group).
Briefly, put together what the 1st and 2nd "surveyselect" does.

Regardless of your answer, thank you for your time and availability and congratulations on the elegance of the solution.

Best 
KV

FreelanceReinh · Posted 04-05-2023 10:37 AM

@Kraussvan wrote:

The only improvement that could be made (I don't know if it's possible) is in the "surveyselect" that creates the 3 groups using a table with a greater number of records but it chooses only the number that is intended. (15 20 per group).
Briefly, put together what the 1st and 2nd "surveyselect" does.

Good idea. Indeed, you can get the desired result with only one PROC SURVEYSELECT step (using the HAVE dataset, sorted by CLUSTER) -- plus a simple post-processing DATA step:

proc surveyselect data=have
method=srs n=(45 75)
seed=2718 out=samp(drop=s:) outrandom;
strata cluster;
run;

data want;
set samp;
groupID=mod(_n_,3)+1;
run;

Kraussvan · Posted 04-05-2023 12:20 PM

😁

Basically, you use a second step to create the 3 groups. You just don't do it using "surveyselect".
My intention was, in the first "surveyselect", to create the group flag.

Thank you anyway

FreelanceReinh · Posted 04-05-2023 01:23 PM

You're welcome. I don't think you can get the group labels from PROC SURVEYSELECT in only one step from the sorted HAVE dataset (unless there's a corresponding new feature in more recent SAS versions; I'm using SAS/STAT 14.3).

But, in a sense, the new solution creates the groups in the first step. They are just not labeled. The OUTRANDOM option is the key. In the earlier solution the group assignment required a genuine PROC SURVEYSELECT capability (randomization) and a random seed had to be provided. The DATA step in the new solution, however, does not involve randomization. It just applies a (deterministic) function to the observation numbers in order to obtain the group labels. This function (or a similar function for that purpose) could have been defined in advance so that the observation numbers would imply the group labels. Assigning them explicitly in a DATA step would then be merely a matter of convenience.

ballardw · Posted 04-05-2023 08:49 AM

"The closer I got was using the "proc surveyselect", but can't guarantee that ID's are not repeated and can't choose different sizes for different cluster."

Depends on your data. Are any Id values associated with different cluster values? Repeated within a cluster?

You might share the code you attempted. Best would be to include data that behaves like your working set.

And yes you can select different sample sizes for different strata.

if your Strata variable has 3 levels then you use something like SAMPSIZE=(10 15 8). Which will select 10 observations from the first strata, 15 from the second and 8 from the third. OR use SAMPSIZE= <somedataset> that if structured correctly will also select different numbers of observations per strata.

-The ID's are not replaceable (they are key in the first and last table)

That is null content because you have not provided any details about any "first" or "last" table or how they are used or relate to the problem. And what do you mean by "not replaceable"?

Kraussvan · Posted 04-05-2023 10:07 AM

Hi Ballardw,

"Depends on your data. Are any Id values associated with different cluster values? Repeated within a cluster?"

No, ID are key on the input data set.

"You might share the code you attempted. Best would be to include data that behaves like your working set."

Ksharp solution was pretty similar to my aproach.

"And yes you can select different sample sizes for different strata."

Yes, the problem is to make all groups

"-The ID's are not replaceable (they are key in the first and last table)"

Sorry, I mean the source data set and the final output.

Thank you for your time.

Best

KV

Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

Re: Stratified sampling

SAS Innovate 2025: Save the Date