This is a general problem that comes up frequently for me, and I am in search of a more elegant solution.
To illustrate, consider a simplified example where we have 35 observations distributed across 11 groups. Nine of the groups have 3 observations, and two of the groups have 4 observations. Here is what the data we have looks like:
proc plan seed=34721;
factors group=11 / noprint;
output out=have;
run;
data have(drop=size);
set have curobs=tmp;
if tmp<=2 then size = 4;
else size = 3;
do observation = 1 to size;
output;
end;
run;
proc sort data=have;
by group;
run;
We want to randomly remove 15 observations from these groups, under the constraint that we cannot have only ONE observation remaining in any group. So, for example, for a group with 3 observations in it, we can either remove 1 observation from it (leaving 2 behind) or remove all 3 observations (eliminating the group entirely), but we cannot remove 2 observations from it (leaving only 1 behind).
The way we have done this in the past would be to do something like the following: randomly remove 2 observations from each group of 4, then randomly remove 1 observation from each group of 3, giving us 13 removed observations and all groups with exactly 2 observations remaining. We then randomly select a group from which to remove the final 2 observations, giving us our desired 15 (and only eliminating a single group).
This approach can be clunky and time-consuming, because it is not always predictable the exact number of groups/group sizes we will have, and a solution for set of groups/sizes won't be easily generalizable to the next dataset (in fact, the situation can get even more complicated with multiple rounds of removing observations from the same set of groups, and a requirement to keep a minimum number of groups, but I want to keep this example simple).
I have tried playing around with PROC SURVEYSELECT nested within a %do %while macro loop and just run it until an allocation meets our criteria (no groups with 1 observation) but I haven't been able to get this to work (at least in a reasonable amount of time). For example, something like:
proc surveyselect data=have n=15 out=want outall noprint;
strata group / alloc=prop;
run;
Using ALLOCMIN=2 doesn't help (in fact in this case it returns an error, regardless of whether you use n=15 or n=20, depending on whether you are trying to sample the 'removed' or 'remaining' observations).
Can anyone think of a good way of approaching this problem? To put it shortly, we want to be able to randomly remove a given number of observations from across a set of groups of observations, subject to the constraint that no group can ever have only a single observation in it.