This is a general problem that comes up frequently for me, and I am in search of a more elegant solution.
To illustrate, consider a simplified example where we have 35 observations distributed across 11 groups. Nine of the groups have 3 observations, and two of the groups have 4 observations. Here is what the data we have looks like:
proc plan seed=34721; factors group=11 / noprint; output out=have; run; data have(drop=size); set have curobs=tmp; if tmp<=2 then size = 4; else size = 3; do observation = 1 to size; output; end; run; proc sort data=have; by group; run;
We want to randomly remove 15 observations from these groups, under the constraint that we cannot have only ONE observation remaining in any group. So, for example, for a group with 3 observations in it, we can either remove 1 observation from it (leaving 2 behind) or remove all 3 observations (eliminating the group entirely), but we cannot remove 2 observations from it (leaving only 1 behind).
The way we have done this in the past would be to do something like the following: randomly remove 2 observations from each group of 4, then randomly remove 1 observation from each group of 3, giving us 13 removed observations and all groups with exactly 2 observations remaining. We then randomly select a group from which to remove the final 2 observations, giving us our desired 15 (and only eliminating a single group).
This approach can be clunky and time-consuming, because it is not always predictable the exact number of groups/group sizes we will have, and a solution for set of groups/sizes won't be easily generalizable to the next dataset (in fact, the situation can get even more complicated with multiple rounds of removing observations from the same set of groups, and a requirement to keep a minimum number of groups, but I want to keep this example simple).
I have tried playing around with PROC SURVEYSELECT nested within a %do %while macro loop and just run it until an allocation meets our criteria (no groups with 1 observation) but I haven't been able to get this to work (at least in a reasonable amount of time). For example, something like:
proc surveyselect data=have n=15 out=want outall noprint;
strata group / alloc=prop;
run;Using ALLOCMIN=2 doesn't help (in fact in this case it returns an error, regardless of whether you use n=15 or n=20, depending on whether you are trying to sample the 'removed' or 'remaining' observations). Slightly simplified and with a couple corrections (first step just makes test data). Change the SEED at the top to change both the input data and the randomization.
%let mingrps=7;
%let rmobs=15;
%let seed=3135398;
* make test data ;
data test0;
call streaminit(&seed);
do gn=1 to 10;
grp=rand('integer',100,500);
nmem=rand('integer',3,5);
do memnum=1 to nmem;
output;
end;
end;
drop gn nmem;
run;
* add # of members per group if not already part of data (hence dropped above) ;
proc sql noprint undo_policy=none;
create table test1 as
select a.*, b.nmem
from
test0 A
inner join
(select grp, count(*) as nmem from test0 group by grp) B
on a.grp=b.grp
order by a.grp;
quit;
* add a random number to each obs and also make a group -> array location lookup ;
data
test2 (drop=fmtname type label)
grploc (keep=fmtname type grp label rename=(grp=start))
;
set test1 end=last;
by grp;
length fmtname $8 type $1 label 5;
retain fmtname 'fgrploc' type 'N';
call streaminit(&seed);
r=rand('uniform');
if first.grp then do;
label+1;
output grploc;
end;
output test2;
if last then call symputx("ngrps",label);
run;
proc format cntlin=grploc; run;
* sort by r, ignoring group ;
proc sort data=test2; by r; run;
* carry out logic as specified ;
data
dropped
kept
;
set test2;
array T {&ngrps} _temporary_;
array C {&ngrps} _temporary_;
retain groups_remaining &ngrps T 0 C 0;
nremoved=sum(of T[*]); * total # removed already ;
ncomm=sum(of C[*]); * total # committed to remove downstream ;
gloc=put(grp,fgrploc.)*1;
remain=nmem-T[gloc]; * # left in this group ;
if nremoved+(ncomm-C[gloc])>=&rmobs then output kept;
else if remain=2 then do; * special case - only remove if final member can also be removed ;
if nremoved+ncomm+1<&rmobs and groups_remaining>&mingrps then do;
T[gloc]+1;
C[gloc]=1;
groups_remaining+(-1); * decrement now d/t commitment to remove final member ;
output dropped;
end;
else output kept;
end;
else do;
T[gloc]+1;
C[gloc]=0;
output dropped;
end;
keep grp memnum nmem;
run;
title "kept";
proc print data=kept; run;
title "dropped";
proc print data=dropped; run;
OK. Assuming each group at least has 3 obs.
And this code could give you a start. I think there were many scenarios you would not like to see.
E.X. this code is work for n=10 not for n=15. How would you to do when n=14 or n=15 or n=16?
proc plan seed=34721;
factors group=11 / noprint;
output out=have;
run;
data have(drop=size);
set have curobs=tmp;
if tmp<=2 then size = 4;
else size = 3;
do observation = 1 to size;
output;
end;
run;
proc sort data=have;
by group;
run;
%macro sample(n=);
data have2;
set have;
run;
%do i=1 %to &n.;
/*Randomly pick up an obs and output it*/
proc surveyselect data=have2 out=one sampsize=1 seed=123 noprint;
run;
proc append base=want data=one;run;
proc sql; /*exclude this obs from have*/
create table rest as
select * from have2
except
select * from one
;
quit;
/*calcuate the count of each group*/
proc freq data=rest noprint;
table group/out=freq nocum nopercent;
run;
proc sql; /*exclue the group when its count equal 2*/
create table have2 as
select * from rest where group not in (select distinct group from freq where count=2);
quit;
%end;
%mend;
proc delete data=want;run;
%sample(n=10)
/*Check the result,
The want dataset is the obs you want to exclude*/
proc print data=want;run;
And the most simple way is if you know the number of obs in each group you need to remove, just assign them in PROC SURVEYSELECT.
/******************/
data samplesize;
input group samplesize;
cards;
1 2
2 1
3 1
4 1
5 1
6 2
7 3
8 1
9 1
10 1
11 1
;
proc surveyselect data=have sampsize=samplesize seed=123 out=want;
strata group;
run;
You could try something like this. Haven't thoroughly tested the logic, but should be pretty close. This also handles the condition that you may have a minimum number of allowable groups.
First, some test data:
data test0;
call streaminit(3135398);
do gn=1 to 10;
grp=rand('integer',100,500);
nmem=rand('integer',3,5);
do memnum=1 to nmem;
output;
end;
end;
drop gn nmem;
run;
...add in a count of members for each group (this is just assuming this count wasn't in the original data - hence dropped above):
proc sql noprint undo_policy=none;
create table test1 as
select a.*, b.nmem
from
test0 A
inner join
(select grp, count(*) as nmem from test0 group by grp) B
on a.grp=b.grp
order by a.grp;
quit;
...assign a random number to each row, and also create a lookup format to associate each group with a sequential number:
data
test2 (drop=fmtname type label)
grploc (keep=fmtname type grp label rename=(grp=start))
;
set test1 end=last;
by grp;
length fmtname $8 type $1 label 5;
retain fmtname 'fgrploc' type 'N';
call streaminit(3135398);
r=rand('uniform');
if first.grp then do;
label+1;
output grploc;
end;
output test2;
if last then call symputx("ngrps",label);
run;
proc format cntlin=grploc; run;
* sort by r, ignoring group ;
proc sort data=test2; by r; run;
...finally, carry out the process you described:
%let mingrps=7;
%let rmobs=15;
data
dropped
kept
;
set test2;
array T {&ngrps} _temporary_;
array C {&ngrps} _temporary_;
retain groups_remaining &ngrps T 0 C 0;
nremoved=sum(of T[*]);
ncomm=sum(of C[*]);
gloc=put(grp,fgrploc.)*1;
remain=nmem-T[gloc];
if nremoved=&rmobs then output kept;
else if remain=2 then do;
if nremoved+ncomm<&rmobs and groups_remaining>&mingrps then do;
T[gloc]+1;
C[gloc]=1;
output dropped;
end;
else output kept;
end;
else if remain=1 then do;
T[gloc]+1;
C[gloc]=0;
groups_remaining+(-1);
output dropped;
end;
else do;
T[gloc]+1;
output dropped;
end;
run;
proc sort data=kept; by grp; run;
title "kept";
proc print data=kept; run;
title "dropped";
proc print data=dropped; run;
title;
Slightly simplified and with a couple corrections (first step just makes test data). Change the SEED at the top to change both the input data and the randomization.
%let mingrps=7;
%let rmobs=15;
%let seed=3135398;
* make test data ;
data test0;
call streaminit(&seed);
do gn=1 to 10;
grp=rand('integer',100,500);
nmem=rand('integer',3,5);
do memnum=1 to nmem;
output;
end;
end;
drop gn nmem;
run;
* add # of members per group if not already part of data (hence dropped above) ;
proc sql noprint undo_policy=none;
create table test1 as
select a.*, b.nmem
from
test0 A
inner join
(select grp, count(*) as nmem from test0 group by grp) B
on a.grp=b.grp
order by a.grp;
quit;
* add a random number to each obs and also make a group -> array location lookup ;
data
test2 (drop=fmtname type label)
grploc (keep=fmtname type grp label rename=(grp=start))
;
set test1 end=last;
by grp;
length fmtname $8 type $1 label 5;
retain fmtname 'fgrploc' type 'N';
call streaminit(&seed);
r=rand('uniform');
if first.grp then do;
label+1;
output grploc;
end;
output test2;
if last then call symputx("ngrps",label);
run;
proc format cntlin=grploc; run;
* sort by r, ignoring group ;
proc sort data=test2; by r; run;
* carry out logic as specified ;
data
dropped
kept
;
set test2;
array T {&ngrps} _temporary_;
array C {&ngrps} _temporary_;
retain groups_remaining &ngrps T 0 C 0;
nremoved=sum(of T[*]); * total # removed already ;
ncomm=sum(of C[*]); * total # committed to remove downstream ;
gloc=put(grp,fgrploc.)*1;
remain=nmem-T[gloc]; * # left in this group ;
if nremoved+(ncomm-C[gloc])>=&rmobs then output kept;
else if remain=2 then do; * special case - only remove if final member can also be removed ;
if nremoved+ncomm+1<&rmobs and groups_remaining>&mingrps then do;
T[gloc]+1;
C[gloc]=1;
groups_remaining+(-1); * decrement now d/t commitment to remove final member ;
output dropped;
end;
else output kept;
end;
else do;
T[gloc]+1;
C[gloc]=0;
output dropped;
end;
keep grp memnum nmem;
run;
title "kept";
proc print data=kept; run;
title "dropped";
proc print data=dropped; run;
Thank you! This does appear to work as intended, though I admit I don't 100% follow the logic yet. I'll have to spend some time working through those array statements to make sure I actually understand what they are doing, and then I will try extending this to some of the more complex situations we face.
I'll accept the solution, but could you just clarify why you randomly sort the test2 dataset ignoring group? I'm not sure I understand why it needs to be sorted that way, though from some quick testing it appears to be important for the results to come out right.
Quick note that this method assumes that the input dataset does not already have groups with only one member to begin with. If that were a possibility, you might want to add a quick check to the beginning of the whole process (prior to the step that creates the &ngrps count) that removes any such groups... and perhaps a further check that the number of groups is, at the start, >= the minimum number of required groups.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.