Hello,
I'm trying to create 4 equal groups(or close to equal - doesn't have to be exact) using the data below as an example. What I would like to accomplish is have the 4 groups have similar means (or sums) of all the variables. So basically I'm trying to minimize the mean, or sum, difference between the groups.
I'm trying to create a generic template also for the code, so there could be cases where there are 2,4, or other number of groups, and could also have more variables to average or sum, if you want to keep this in mind when thinking of the solution...I would like it to be as dynmaic as possible.
Any advice is appreciated!
Thanks,
Tom
data have;
input x y z;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
The "want" dataset would simply be the one from above with a "groupID" variable next to each one.
So you have multiple objectives. Here is one way to combine them into a single objective that minimizes the sum of the ranges:
data have;
input x1-x3;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
%let num_vars = 3;
%let num_groups = 4;
proc optmodel;
set VARS = 1..&num_vars;
set GROUPS = 1..&num_groups;
set OBS;
num a {OBS, VARS};
read data have into OBS=[_N_] {j in VARS} <a[_N_,j]=col('x'||j)>;
/* Assign[i,g] = 1 if observation i assigned to group g, 0 otherwise */
var Assign {OBS, GROUPS} binary;
con AssignOnce {i in OBS}:
sum {g in GROUPS} Assign[i,g] = 1;
con NearlyEqual {g in GROUPS}:
floor(card(OBS)/&num_groups) <= sum {i in OBS} Assign[i,g] <= ceil(card(OBS)/&num_groups);
impvar GroupSum {g in GROUPS, j in VARS} = sum {i in OBS} a[i,j] * Assign[i,g];
var MinSum {VARS}, MaxSum {VARS};
con MinSumCon {g in GROUPS, j in VARS}:
MinSum[j] <= GroupSum[g,j];
con MaxSumCon {g in GROUPS, j in VARS}:
MaxSum[j] >= GroupSum[g,j];
impvar Range {j in VARS} = MaxSum[j] - MinSum[j];
min Objective = sum {j in VARS} Range[j];
solve;
print Assign;
print GroupSum;
num groupID {OBS};
for {i in OBS} do;
for {g in GROUPS: Assign[i,g].sol > 0.5} do;
groupID[i] = g;
leave;
end;
end;
create data want(drop=i) from [i] {j in VARS} <col('x'||j)=a[i,j]> groupID;
quit;
The resulting optimal solution has an objective value of (42 - 41) + (55 - 52) + (44 - 43) = 5:
SAS Output
Assign | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
1 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0 | 1 |
5 | 0 | 0 | 1 | 0 |
6 | 0 | 1 | 0 | 0 |
7 | 0 | 0 | 0 | 1 |
8 | 1 | 0 | 0 | 0 |
9 | 0 | 0 | 1 | 0 |
10 | 0 | -0 | 1 | 0 |
11 | 1 | 0 | 0 | 0 |
12 | 0 | 0 | 1 | 0 |
13 | 0 | 1 | 0 | 0 |
14 | 0 | 0 | 0 | 1 |
15 | 1 | 0 | 0 | 0 |
16 | 0 | 0 | -0 | 1 |
GroupSum | |||
---|---|---|---|
1 | 2 | 3 | |
1 | 41 | 52 | 44 |
2 | 41 | 55 | 44 |
3 | 42 | 52 | 43 |
4 | 41 | 52 | 44 |
If the variables have different scales, you might want to instead minimize a weighted sum of ranges:
min Objective = sum {j in VARS} weight[j] * Range[j];
For example, you might take the weight of variable j to be 1 / max {i in OBS} a[i,j].
Would you expect some of the GroupId to be missing for some obs that don't belong to any group?
no, I would expect every observation to have a GroupID
To minimize the difference of means or other statistic of mulltiple variables simulaneously is likely to be difficult exercise.
Here is a grouping approach that might work for some purposes. If the order of the varaibles is reasonally random the means or such should be somewhat similar for large enough data sets. This does have the advantage of simplicity and extensibility if not precision:
Data want;
Set have;
GroupId = mod(_n_,4);
run;
This approach does mean that the largest difference in size of groups should be one (some groups may have one more member than others).
Note, your example data set is a bit small.
This seems to be a clustering problem. If you are okay with SAS/STAT try proc cluster or proc fastclus to assign a label to each observation.
I would agree with @stat_sas that clustering would be a good approach. But getting equal sized groups might prove difficult. A simpler approach would be, as @ballardw suggest, to form equal-sized groups along a single variable. Make that variable the first principal component and you are in business:
%let vars=x y z;
%let nbGroups=2;
proc princomp data=have n=1 out=haveprin;
var &vars;
run;
proc rank data=haveprin groups=&nbGroups out=want(rename=prin1=groupId);
var prin1;
run;
@PGStats: Interesting approach. To learn more about this just a few questions:
Why PCA with only one principal component?
Why only two groups and are these two groups sufficient to make sure variation between groups is large enough to conclude that means are significantly different?
Oops! You make me realize that the ranking approach will maximize the difference between group means, which is contrary to what is required. @ballardw's approach makes more sense.
%let vars=x y z;
%let nbGroups=2;
proc princomp data=have n=1 out=haveprin;
var &vars;
run;
proc sort data=haveprin; by prin1; run;
data want;
set haveprin;
groupId = mod(_n_, &nbGroups);
drop prin1;
run;
But in the end, when you try to balance on many factors, it may be safer to use simple randomisation.
Still not significantly different:
proc glm data=want;
class groupid;
model x y z=groupid;
run;
I suspect a wizard in the IML world might already have an n-dimension difference minimization program but my head started hurting very briefly after contemplating that, especially with the "equal group size" requirement. I figured out pretty quickly I could likely build a data set that would cause issues for any given group size.
It might be time for Triley to expand on the uses of the data after the groupid is added.
It is more like a SAS/OR problem . Can you post it at OR forum. And Can you explain this " 4 groups have similar means (or sums) of all the variables. ". Give us a simple example to explain that. How you calculated these sums, sum of all data of each group ? or sum of each variable ? Here is an example solution by Genetic Algorithm . This is example only have a variable( AGE ),and take dataset sashelp.class as for example. Maybe I should call some OR guys to take a look at this question. proc iml; use sashelp.class; read all var {age} ; close ; start function(x) global(age,group); if ncol(x) ^= ncol(unique(x)) then obj=9999999; else do; y=age[x,]; obj=range(shape(y,group,0,0)[,+]); end; return (obj); finish; start switch_mut(s) global(nswitches,nrow); if ncol(s) ^= ncol(unique(s)) then s=ranperm(nrow,1); else do; n = ncol(s); do i = 1 to nswitches; k1 = int(uniform(1234)*n) + 1; k2 = int(uniform(1234)*n) + 1; if k1^=k2 then do; temp=s[k2]; s[k2] = s[k1]; s[k1] = temp; end; end; end; finish; start uniform_cross(child1, child2, parent1, parent2) global(nrow); child1 = parent1; child2 = parent2; do i = 1 to ncol(parent1); r = uniform(1234); if r<=0.5 then do; child1[i] = parent2[i]; child2[i] = parent1[i]; end; end; if ncol(child1) ^= ncol(unique(child1)) then child1=ranperm(nrow,1); else if ncol(child2) ^= ncol(unique(child2)) then child2=ranperm(nrow,1); finish; nrow=nrow(age); group=4; nswitches = 3; encoding=j(2,nrow,1); encoding[2,]=nrow; id=gasetup(2,nrow,1234); call gasetobj(id,0,"function"); call gasetcro(id,0.95,0,"uniform_cross"); call gasetmut(id,0.95,0,"switch_mut"); call gasetsel(id,100,1,1); call gainit(id,1000,encoding); niter = 10000; summary = j(niter,2); mattrib summary [c = {"Min Value", "Avg Value"} l=""]; do i = 1 to niter; call garegen(id); call gagetval(value, id); summary[i,1] = value[1]; summary[i,2] = value[:]; end; call gagetmem(mem, value, id, 1); Memebers=shape(age[mem,],group,0,0); print "Members:" Memebers[l=""], "Group sum:" (Memebers[,+]), "Min Value: " value[l = ""] ; call gaend(id); quit; OUTPUT: Members: 13 12 14 11 14 11 15 14 12 12 13 13 12 12 14 15 16 15 15 0 Group sum: 64 64 64 61 Min Value: 3
Thank you for your response.
To answer your question, what i'm trying to do is create 4 groups that are as similar as possible (among all variables). So in this case, i have 3 variables that i want to create the groups based on.
You can think of the variables on a customer basis by "# of site visits" (x), "# of products viewed" (y), and "# of products bought" (z). I just made this example up so the numbers i provided previously don't make sense for it, but i want to create 4 groups to test different website designs to determine a "winner".
I want the 4 groups to be as similar as possible, so that when they are in the final groups, all groups have a similar mean, or sum, of each of the variables. So Group 1 - 4 should have a similar mean of x, y, and z.
I don't know if this will change any of the answers, as I have not had time to go through and test them all yet, but thought I would clarify this first to give anyone a chance to re-visit their solution if it didn't match what I was hoping for.
Thank you again for your response.
Here is a flexible (maybe too flexible ) approach to the problem of creating groups of equal sizes, means and variances. The idea is to create some well balanced experimental groups.
/* Goal : divide data into equal size groups with similar means
and variances and define clusters of matched observations */
/* Example data, including id */
data have;
length id $8;
input x y z;
id = cats("OBS_", _n_);
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
/* Number of groups to form */
%let nbGroups=4;
/* Define a distance between observations. Here I use the Chebychev
distance which is the maximum difference Dx, Dy, or Dz between the
standardized variables */
proc distance data=have method=Chebychev out=haveDist;
var interval(x y z / std=STD);
id id;
run;
/* Multidimentional scaling: Derive a single variable that approximates
the distances between observations. */
proc mds data=haveDist dimension=1
out=haveScaled(where=(_TYPE_="CONFIG"))
plots=none nophist; run;
/* Order observations by the new variable */
proc sort data=haveScaled; by dim1; run;
/* Define clusters of close by observations. Within clusters, assign observations
to separate groups. Do this assignment in a random order. Clusters could be used
to implement pairing. */
data haveGroups;
array grp grp1-grp&nbGroups (1 : &nbGroups);
seed = 8568;
do until (done);
clusterId + 1;
call ranperm(seed, of grp{*});
do i = 1 to &nbGroups until(done);
set haveScaled end=done;
groupId = grp{i};
output;
end;
end;
keep _name_ groupId clusterId;
run;
/* Rejoin the group assignments with original data, using id */
proc sql;
create table want as
select
input(scan(id, 2, "_"), best.) as obs,
have.*,
haveGroups.groupId,
haveGroups.clusterId
from
have inner join
haveGroups on have.id=haveGroups._NAME_
order by obs;
quit;
OK. Here is what I got. Assuming you want compare sum of each group(all data of each group) . It looks like the minimize difference of sum between groups is 1 (divide into 4 groups). Maybe some SAS/OR guys could offer you some OR code. data have; input x y z; obs+1; datalines; 10 15 8 11 14 9 12 16 10 9 16 8 8 12 10 8 10 11 11 14 12 13 10 9 8 16 15 10 11 8 11 15 14 16 13 10 10 15 14 12 8 14 7 12 13 9 14 10 ; run; proc iml; use have; read all var {x y z} into age[c=vname] ; read all var {obs}; close ; start function(x) global(age,group,nrow); if ncol(x) ^= ncol(unique(x)) then x=ranperm(nrow,1); y=shape(x,group,0,0); sum_group=j(group,1,.); do i=1 to group; sum_group[i]=age[y[i,],][+]; end; obj=range(sum_group); return (obj); finish; start switch_mut(s) global(nswitches,nrow); if ncol(s) ^= ncol(unique(s)) then s=ranperm(nrow,1); else do; n = ncol(s); do i = 1 to nswitches; k1 = int(uniform(1234)*n) + 1; k2 = int(uniform(1234)*n) + 1; if k1^=k2 then do; temp=s[k2]; s[k2] = s[k1]; s[k1] = temp; end; end; end; finish; start uniform_cross(child1, child2, parent1, parent2) global(nrow); child1 = parent1; child2 = parent2; do i = 1 to ncol(parent1); r = uniform(1234); if r<=0.5 then do; child1[i] = parent2[i]; child2[i] = parent1[i]; end; end; if ncol(child1) ^= ncol(unique(child1)) then child1=ranperm(nrow,1); else if ncol(child2) ^= ncol(unique(child2)) then child2=ranperm(nrow,1); finish; nrow=nrow(age); group=4; /*divide into 4 groups*/ nswitches = 3; encoding=j(2,nrow,1); encoding[2,]=nrow; id=gasetup(2,nrow,1234); call gasetobj(id,0,"function"); call gasetcro(id,0.95,0,"uniform_cross"); call gasetmut(id,0.95,0,"switch_mut"); call gasetsel(id,100,1,1); call gainit(id,1000,encoding); niter = 10000; do i = 1 to niter; call garegen(id); call gagetval(value, id); end; call gagetmem(mem, value, id, 1); y=shape(mem,group,0,0); sum_group=j(group,2,.); do i=1 to group; sum_group[i,1]=i; sum_group[i,2]=age[y[i,],][+]; end; Memebers=colvec(row(y))||obs[mem,]||age[mem,]; vnames={group obs}||vname; print "Members:" Memebers[c=vnames], "Group sum:" sum_group[c={group sum}], "Min Value: " value[l = ""] ; call gaend(id); quit; OUTPUT: GROUP OBS x y z 1 4 9 16 8 1 16 9 14 10 1 11 11 15 14 1 8 13 10 9 2 13 10 15 14 2 5 8 12 10 2 10 10 11 8 2 9 8 16 15 3 12 16 13 10 3 6 8 10 11 3 7 11 14 12 3 1 10 15 8 4 14 12 8 14 4 2 11 14 9 4 15 7 12 13 4 3 12 16 10 GROUP SUM 1 138 2 137 3 138 4 138 Min Value: 1
Here's one way to minimize the range by using PROC OPTMODEL in SAS/OR:
data have;
input x1-x3;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
%let num_vars = 3;
%let num_groups = 4;
proc optmodel;
set VARS = 1..&num_vars;
set GROUPS = 1..&num_groups;
set OBS;
num a {OBS, VARS};
read data have into OBS=[_N_] {j in VARS} <a[_N_,j]=col('x'||j)>;
/* Assign[i,g] = 1 if observation i assigned to group g, 0 otherwise */
var Assign {OBS, GROUPS} binary;
con AssignOnce {i in OBS}:
sum {g in GROUPS} Assign[i,g] = 1;
con NearlyEqual {g in GROUPS}:
floor(card(OBS)/&num_groups) <= sum {i in OBS} Assign[i,g] <= ceil(card(OBS)/&num_groups);
impvar GroupSum {g in GROUPS} = sum {i in OBS, j in VARS} a[i,j] * Assign[i,g];
var MinSum, MaxSum;
con MinSumCon {g in GROUPS}:
MinSum <= GroupSum[g];
con MaxSumCon {g in GROUPS}:
MaxSum >= GroupSum[g];
min Range = MaxSum - MinSum;
solve;
print Assign;
print GroupSum;
num groupID {OBS};
for {i in OBS} do;
for {g in GROUPS: Assign[i,g].sol > 0.5} do;
groupID[i] = g;
leave;
end;
end;
create data want(drop=i) from [i] {j in VARS} <col('x'||j)=a[i,j]> groupID;
quit;
SAS Output
[1] | GroupSum |
---|---|
1 | 137 |
2 | 138 |
3 | 138 |
4 | 138 |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.