BookmarkSubscribeRSS Feed
RyNye
Calcite | Level 5

This is a general problem that comes up frequently for me, and I am in search of a more elegant solution. 

 

To illustrate, consider a simplified example where we have 35 observations distributed across 11 groups. Nine of the groups have 3 observations, and two of the groups have 4 observations. Here is what the data we have looks like: 

 

proc plan seed=34721;
	factors group=11 / noprint;
	output out=have;	
run;
data have(drop=size);
	set have curobs=tmp;
	if tmp<=2 then size = 4;
	else size = 3;
	do observation = 1 to size;
		output;
	end;
run;
proc sort data=have;
	by group;
run;

We want to randomly remove 15 observations from these groups, under the constraint that we cannot have only ONE observation remaining in any group. So, for example, for a group with 3 observations in it, we can either remove 1 observation from it (leaving 2 behind) or remove all 3 observations (eliminating the group entirely), but we cannot remove 2 observations from it (leaving only 1 behind). 

 

 

The way we have done this in the past would be to do something like the following: randomly remove 2 observations from each group of 4, then randomly remove 1 observation from each group of 3, giving us 13 removed observations and all groups with exactly 2 observations remaining. We then randomly select a group from which to remove the final 2 observations, giving us our desired 15 (and only eliminating a single group). 

 

This approach can be clunky and time-consuming, because it is not always predictable the exact number of groups/group sizes we will have, and a solution for set of groups/sizes won't be easily generalizable to the next dataset (in fact, the situation can get even more complicated with multiple rounds of removing observations from the same set of groups, and a requirement to keep a minimum number of groups, but I want to keep this example simple).

 

I have tried playing around with PROC SURVEYSELECT nested within a %do %while macro loop and just run it until an allocation meets our criteria (no groups with 1 observation) but I haven't been able to get this to work (at least in a reasonable amount of time). For example, something like:

proc surveyselect data=have n=15 out=want outall noprint;
	strata group / alloc=prop;
run;
Using ALLOCMIN=2 doesn't help (in fact in this case it returns an error, regardless of whether you use n=15 or n=20, depending on whether you are trying to sample the 'removed' or 'remaining' observations). 
 
Can anyone think of a good way of approaching this problem? To put it shortly, we want to be able to randomly remove a given number of observations from across a set of groups of observations, subject to the constraint that no group can ever have only a single observation in it. 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 0 replies
  • 80 views
  • 0 likes
  • 1 in conversation