topic Re: Creating Groups of Equal Size in SAS Programming

Creating Groups of Equal Size

triley — Fri, 06 May 2016 17:24:33 GMT

Hello,

I'm trying to create 4 equal groups(or close to equal - doesn't have to be exact) using the data below as an example. What I would like to accomplish is have the 4 groups have similar means (or sums) of all the variables. So basically I'm trying to minimize the mean, or sum, difference between the groups.

I'm trying to create a generic template also for the code, so there could be cases where there are 2,4, or other number of groups, and could also have more variables to average or sum, if you want to keep this in mind when thinking of the solution...I would like it to be as dynmaic as possible.

Any advice is appreciated!

Thanks,

Tom

data have;
input x y z;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;

The "want" dataset would simply be the one from above with a "groupID" variable next to each one.

Re: Creating Groups of Equal Size

PGStats — Fri, 06 May 2016 17:45:32 GMT

Would you expect some of the GroupId to be missing for some obs that don't belong to any group?

Re: Creating Groups of Equal Size

triley — Fri, 06 May 2016 17:47:07 GMT

no, I would expect every observation to have a GroupID

Re: Creating Groups of Equal Size

ballardw — Fri, 06 May 2016 17:53:47 GMT

To minimize the difference of means or other statistic of mulltiple variables simulaneously is likely to be difficult exercise.

Here is a grouping approach that might work for some purposes. If the order of the varaibles is reasonally random the means or such should be somewhat similar for large enough data sets. This does have the advantage of simplicity and extensibility if not precision:

Data want;

Set have;

GroupId = mod(_n_,4);

run;

This approach does mean that the largest difference in size of groups should be one (some groups may have one more member than others).

Note, your example data set is a bit small.

Re: Creating Groups of Equal Size

stat_sas — Fri, 06 May 2016 18:47:20 GMT

This seems to be a clustering problem. If you are okay with SAS/STAT try proc cluster or proc fastclus to assign a label to each observation.

Re: Creating Groups of Equal Size

PGStats — Fri, 06 May 2016 20:18:39 GMT

I would agree with @stat_sas that clustering would be a good approach. But getting equal sized groups might prove difficult. A simpler approach would be, as @ballardw suggest, to form equal-sized groups along a single variable. Make that variable the first principal component and you are in business:

%let vars=x y z;
%let nbGroups=2;

proc princomp data=have n=1 out=haveprin;
var &vars;
run;

proc rank data=haveprin groups=&nbGroups out=want(rename=prin1=groupId);
var prin1;
run;

Re: Creating Groups of Equal Size

stat_sas — Fri, 06 May 2016 20:48:26 GMT

@PGStats: Interesting approach. To learn more about this just a few questions:

Why PCA with only one principal component?

Why only two groups and are these two groups sufficient to make sure variation between groups is large enough to conclude that means are significantly different?

Re: Creating Groups of Equal Size

PGStats — Fri, 06 May 2016 21:15:04 GMT

Oops! You make me realize that the ranking approach will maximize the difference between group means, which is contrary to what is required. @ballardw's approach makes more sense.

%let vars=x y z;
%let nbGroups=2;

proc princomp data=have n=1 out=haveprin;
var &vars;
run;

proc sort data=haveprin; by prin1; run;

data want;
set haveprin;
groupId = mod(_n_, &nbGroups);
drop prin1;
run;

But in the end, when you try to balance on many factors, it may be safer to use simple randomisation.

Re: Creating Groups of Equal Size

stat_sas — Fri, 06 May 2016 21:25:51 GMT

Still not significantly different:

proc glm data=want;
class groupid;
model x y z=groupid;
run;

Re: Creating Groups of Equal Size

ballardw — Fri, 06 May 2016 22:05:23 GMT

I suspect a wizard in the IML world might already have an n-dimension difference minimization program but my head started hurting very briefly after contemplating that, especially with the "equal group size" requirement. I figured out pretty quickly I could likely build a data set that would cause issues for any given group size.

It might be time for Triley to expand on the uses of the data after the groupid is added.

Re: Creating Groups of Equal Size

Ksharp — Sat, 07 May 2016 03:01:22 GMT

It is more like a SAS/OR problem . Can you post it at OR forum.
And Can you explain this " 4 groups have similar means (or sums) of all the variables. ". 
Give us a simple example to explain that. How you calculated these sums, 
sum of all data of each group ? or sum of each variable ?

Here is an example solution by Genetic Algorithm . 
This is example only have a variable( AGE ),and take dataset sashelp.class as for example.
Maybe I should call some OR guys to take a look at this question.




proc iml;
use sashelp.class;
read all var {age} ;
close ;

start function(x) global(age,group);
if ncol(x) ^= ncol(unique(x)) then obj=9999999;
else do;
 y=age[x,];
 obj=range(shape(y,group,0,0)[,+]);
end;
 return (obj);
finish;

start switch_mut(s) global(nswitches,nrow);
if ncol(s) ^= ncol(unique(s)) then s=ranperm(nrow,1);
else do;
  n = ncol(s);
  do i = 1 to nswitches;
   k1 = int(uniform(1234)*n) + 1;
   k2 = int(uniform(1234)*n) + 1;
   if k1^=k2 then do;
    temp=s[k2];
    s[k2] = s[k1];
    s[k1] = temp;
   end;
  end;
end;
finish;

start uniform_cross(child1, child2, parent1, parent2) global(nrow);

child1 = parent1;
child2 = parent2;
do i = 1 to ncol(parent1);
  r = uniform(1234);
  if r<=0.5 then do;
   child1[i] = parent2[i];
   child2[i] = parent1[i];
  end;
end;

if ncol(child1) ^= ncol(unique(child1)) then child1=ranperm(nrow,1);
 else if ncol(child2) ^= ncol(unique(child2)) then child2=ranperm(nrow,1);

finish;



nrow=nrow(age);
group=4;
nswitches = 3;
encoding=j(2,nrow,1);
encoding[2,]=nrow;    

id=gasetup(2,nrow,1234);
call gasetobj(id,0,"function");
call gasetcro(id,0.95,0,"uniform_cross");
call gasetmut(id,0.95,0,"switch_mut");
call gasetsel(id,100,1,1);
call gainit(id,1000,encoding);


niter = 10000;
summary = j(niter,2);
mattrib summary [c = {"Min Value", "Avg Value"} l=""];
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
 summary[i,1] = value[1];
 summary[i,2] = value[:];
end;
call gagetmem(mem, value, id, 1);


Memebers=shape(age[mem,],group,0,0);
print "Members:" Memebers[l=""],
      "Group sum:" (Memebers[,+]),
      "Min Value: " value[l = ""] ;
call gaend(id);
quit;




OUTPUT:
Members:	
        13	12	14	11	14
 	11	15	14	12	12
 	13	13	12	12	14
 	15	16	15	15	0
Group sum:	
        64
 	64
 	64
 	61
Min Value:	
        3

Re: Creating Groups of Equal Size

PGStats — Sat, 07 May 2016 05:25:07 GMT

Here is a flexible (maybe too flexible ) approach to the problem of creating groups of equal sizes, means and variances. The idea is to create some well balanced experimental groups.

/* Goal : divide data into equal size groups with similar means 
 and variances and define clusters of matched observations */
/* Example data, including id */
data have;
length id $8;
input x y z;
id = cats("OBS_", _n_);
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;

/* Number of groups to form */
%let nbGroups=4;

/* Define a distance between observations. Here I use the Chebychev 
 distance which is the maximum difference Dx, Dy, or Dz between the 
 standardized variables */ 
proc distance data=have method=Chebychev out=haveDist;
var interval(x y z / std=STD);
id id;
run;

/* Multidimentional scaling: Derive a single variable that approximates 
 the distances between observations. */ 
proc mds data=haveDist dimension=1 
    out=haveScaled(where=(_TYPE_="CONFIG")) 
    plots=none nophist; run;

/* Order observations by the new variable */
proc sort data=haveScaled; by dim1; run;

/* Define clusters of close by observations. Within clusters, assign observations 
 to separate groups. Do this assignment in a random order. Clusters could be used 
 to implement pairing. */
data haveGroups;
array grp grp1-grp&nbGroups (1 : &nbGroups);
seed = 8568;
do until (done);
    clusterId + 1;
    call ranperm(seed, of grp{*});
    do i = 1 to &nbGroups until(done);
        set haveScaled end=done;
        groupId = grp{i};
        output;
        end;
    end;
keep _name_ groupId clusterId;
run;

/* Rejoin the group assignments with original data, using id */
proc sql;
create table want as
select 
    input(scan(id, 2, "_"), best.) as obs,
    have.*, 
    haveGroups.groupId,
    haveGroups.clusterId
from 
    have inner join 
    haveGroups on have.id=haveGroups._NAME_
order by obs;
quit;

Re: Creating Groups of Equal Size

Ksharp — Sat, 07 May 2016 08:54:45 GMT

OK. Here is what I got.
Assuming you want compare sum of each group(all data of each group) .
It looks like the minimize difference of sum between groups is 1 (divide into 4 groups).
Maybe some SAS/OR guys could offer you some OR code. 


data have;
input x y z;
obs+1;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
run;



proc iml;
use have;
read all var {x y z} into age[c=vname] ;
read all var {obs};
close ;

start function(x) global(age,group,nrow);
if ncol(x) ^= ncol(unique(x)) then x=ranperm(nrow,1);
y=shape(x,group,0,0);
sum_group=j(group,1,.);

do i=1 to group;
 sum_group[i]=age[y[i,],][+];
end;
obj=range(sum_group);
return (obj);
finish;

start switch_mut(s) global(nswitches,nrow);
if ncol(s) ^= ncol(unique(s)) then s=ranperm(nrow,1);
else do;
  n = ncol(s);
  do i = 1 to nswitches;
   k1 = int(uniform(1234)*n) + 1;
   k2 = int(uniform(1234)*n) + 1;
   if k1^=k2 then do;
    temp=s[k2];
    s[k2] = s[k1];
    s[k1] = temp;
   end;
  end;
end;
finish;

start uniform_cross(child1, child2, parent1, parent2) global(nrow);

child1 = parent1;
child2 = parent2;
do i = 1 to ncol(parent1);
  r = uniform(1234);
  if r<=0.5 then do;
   child1[i] = parent2[i];
   child2[i] = parent1[i];
  end;
end;

if ncol(child1) ^= ncol(unique(child1)) then child1=ranperm(nrow,1);
 else if ncol(child2) ^= ncol(unique(child2)) then child2=ranperm(nrow,1);

finish;



nrow=nrow(age);
group=4;  /*divide into 4 groups*/
nswitches = 3;
encoding=j(2,nrow,1);
encoding[2,]=nrow;    

id=gasetup(2,nrow,1234);
call gasetobj(id,0,"function");
call gasetcro(id,0.95,0,"uniform_cross");
call gasetmut(id,0.95,0,"switch_mut");
call gasetsel(id,100,1,1);
call gainit(id,1000,encoding);


niter = 10000;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);


y=shape(mem,group,0,0);
sum_group=j(group,2,.);
do i=1 to group;
 sum_group[i,1]=i;
 sum_group[i,2]=age[y[i,],][+];
end;
Memebers=colvec(row(y))||obs[mem,]||age[mem,];
vnames={group obs}||vname;
print "Members:" Memebers[c=vnames],
      "Group sum:" sum_group[c={group sum}],
      "Min Value: " value[l = ""] ;
call gaend(id);
quit;










OUTPUT:
 	
GROUP OBS	x	y	z
	1	4	9	16	8
 	1	16	9	14	10
 	1	11	11	15	14
 	1	8	13	10	9
 	2	13	10	15	14
 	2	5	8	12	10
 	2	10	10	11	8
 	2	9	8	16	15
 	3	12	16	13	10
 	3	6	8	10	11
 	3	7	11	14	12
 	3	1	10	15	8
 	4	14	12	8	14
 	4	2	11	14	9
 	4	15	7	12	13
 	4	3	12	16	10


GROUP	SUM
	1	138
 	2	137
 	3	138
 	4	138

Min Value:	1

Re: Creating Groups of Equal Size

RobPratt — Sat, 07 May 2016 15:32:23 GMT

Here's one way to minimize the range by using PROC OPTMODEL in SAS/OR:

data have;
   input x1-x3;
   datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;

%let num_vars   = 3;
%let num_groups = 4;
proc optmodel;
   set VARS   = 1..&num_vars;
   set GROUPS = 1..&num_groups;
   set OBS;
   num a {OBS, VARS};
   read data have into OBS=[_N_] {j in VARS} <a[_N_,j]=col('x'||j)>;

   /* Assign[i,g] = 1 if observation i assigned to group g, 0 otherwise */
   var Assign {OBS, GROUPS} binary;

   con AssignOnce {i in OBS}:
      sum {g in GROUPS} Assign[i,g] = 1;

   con NearlyEqual {g in GROUPS}:
      floor(card(OBS)/&num_groups) <= sum {i in OBS} Assign[i,g] <= ceil(card(OBS)/&num_groups);

   impvar GroupSum {g in GROUPS} = sum {i in OBS, j in VARS} a[i,j] * Assign[i,g];

   var MinSum, MaxSum;
   con MinSumCon {g in GROUPS}:
      MinSum <= GroupSum[g];
   con MaxSumCon {g in GROUPS}:
      MaxSum >= GroupSum[g];
   min Range = MaxSum - MinSum;

   solve;

   print Assign;
   print GroupSum;
   num groupID {OBS};
   for {i in OBS} do;
      for {g in GROUPS: Assign[i,g].sol > 0.5} do;
         groupID[i] = g;
         leave;
      end;
   end;
   create data want(drop=i) from [i] {j in VARS} <col('x'||j)=a[i,j]> groupID;
quit;

SAS Output

[1]	GroupSum
1	137
2	138
3	138
4	138

Re: Creating Groups of Equal Size

Ksharp — Mon, 09 May 2016 05:29:53 GMT

The following code is for any number of groups. Just change it at :
group=5;  /*divide into 5 groups*/

NOTE: the obs with zeros in all X Y Z are dummy obs. You can ignore them.


data have;
input x y z;
datalines;
10 15 8
11 14 9
12 16 10
9 16 8
8 12 10
8 10 11
11 14 12
13 10 9
8 16 15
10 11 8
11 15 14
16 13 10
10 15 14
12 8 14
7 12 13
9 14 10
;
run;



proc iml;
use have;
read all var _num_ into age[c=vname] ;
close ;

start function(x) global(age,group,nrow);
if ncol(x) ^= ncol(unique(x)) then x=ranperm(nrow,1);
y=shape(x,group,0,0);
sum_group=j(group,1,.);

do i=1 to group;
 sum_group[i]=age[y[i,],][+];
end;
obj=range(sum_group);
return (obj);
finish;

start switch_mut(s) global(nswitches,nrow);
if ncol(s) ^= ncol(unique(s)) then s=ranperm(nrow,1);
else do;
  n = ncol(s);
  do i = 1 to nswitches;
   k1 = int(uniform(1234)*n) + 1;
   k2 = int(uniform(1234)*n) + 1;
   if k1^=k2 then do;
    temp=s[k2];
    s[k2] = s[k1];
    s[k1] = temp;
   end;
  end;
end;
finish;

start uniform_cross(child1, child2, parent1, parent2) global(nrow);

child1 = parent1;
child2 = parent2;
do i = 1 to ncol(parent1);
  r = uniform(1234);
  if r<=0.5 then do;
   child1[i] = parent2[i];
   child2[i] = parent1[i];
  end;
end;

if ncol(child1) ^= ncol(unique(child1)) then child1=ranperm(nrow,1);
 else if ncol(child2) ^= ncol(unique(child2)) then child2=ranperm(nrow,1);

finish;



nrow=nrow(age);
obs=t(1:nrow);
group=5;  /*divide into 5 groups*/

mod=mod(nrow,group);
if mod^=0 then do;
 remain=group-mod;
 age=age//j(remain,ncol(age),0);
 nrow=nrow(age);
 obs=t(1:nrow);
end;
nswitches = 3;
encoding=j(2,nrow,1);
encoding[2,]=nrow;    

id=gasetup(2,nrow,1234);
call gasetobj(id,0,"function");
call gasetcro(id,0.95,0,"uniform_cross");
call gasetmut(id,0.95,0,"switch_mut");
call gasetsel(id,100,1,1);
call gainit(id,1000,encoding);


niter = 10000;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);


y=shape(mem,group,0,0);
sum_group=j(group,2,.);
do i=1 to group;
 sum_group[i,1]=i;
 sum_group[i,2]=age[y[i,],][+];
end;
Memebers=colvec(row(y))||obs[mem,]||age[mem,];
vnames={group obs}||vname;
print "Members:" Memebers[c=vnames],
      "Group sum:" sum_group[c={group sum}],
      "Min Value: " value[l = ""] ;
call gaend(id);
quit;






OUTPUT:

GROUP	OBS	x	y	z
	1	14	12	8	14
 	1	12	16	13	10
 	1	17	0	0	0
 	1	4	9	16	8
 	2	10	10	11	8
 	2	6	8	10	11
 	2	8	13	10	9
 	2	5	8	12	10
 	3	7	11	14	12
 	3	2	11	14	9
 	3	20	0	0	0
 	3	13	10	15	14
 	4	11	11	15	14
 	4	19	0	0	0
 	4	16	9	14	10
 	4	1	10	15	8
 	5	3	12	16	10
 	5	18	0	0	0
 	5	9	8	16	15
 	5	15	7	12	13
 	
GROUP	SUM
	1	106
 	2	120
 	3	110
 	4	106
 	5	109
Min Value:	14

Re: Creating Groups of Equal Size

RobPratt — Sun, 08 May 2016 13:07:56 GMT

With five groups, the PROC OPTMODEL code finds that 14 is the minimum range:

[1]	GroupSum
1	107
2	110
3	108
4	106
5	120

Re: Creating Groups of Equal Size

Ksharp — Mon, 09 May 2016 00:55:57 GMT

Rob,

OR well done. I can't image SAS/OR can do solve so many different kind of challenge .

Re: Creating Groups of Equal Size

triley — Mon, 09 May 2016 12:46:57 GMT

Thank you for your response.

To answer your question, what i'm trying to do is create 4 groups that are as similar as possible (among all variables). So in this case, i have 3 variables that i want to create the groups based on.

You can think of the variables on a customer basis by "# of site visits" (x), "# of products viewed" (y), and "# of products bought" (z). I just made this example up so the numbers i provided previously don't make sense for it, but i want to create 4 groups to test different website designs to determine a "winner".

I want the 4 groups to be as similar as possible, so that when they are in the final groups, all groups have a similar mean, or sum, of each of the variables. So Group 1 - 4 should have a similar mean of x, y, and z.

I don't know if this will change any of the answers, as I have not had time to go through and test them all yet, but thought I would clarify this first to give anyone a chance to re-visit their solution if it didn't match what I was hoping for.

Thank you again for your response.

Re: Creating Groups of Equal Size

triley — Mon, 09 May 2016 12:58:33 GMT

Rob, I appreciate your response as I am very interested in creating an OPTMODEL for this. However, I think that the result is slightly different from what I was actually looking for (my fault) and have clarified my question in a reply in a previous post. If you get a chance to modify for the output I am looking for I would appreciate it.

Basically, instead of combing the sum of all variables together, I want to treat them independently, and make the mean (or sum) of each variable similar between the 4 groups.

So for example, group1 would have a mean of 10 for "x", 15 for "y" and 12 for "z".

Group2 would have a mean of 11 for "x", 14 for "y" and 13 for "z" (so that the means of each variable are similar among groups).

This is for designing a pilot test, so the means or sums of each group need to be similar to remove variability from the design experiment.

Thanks,

Tom

Re: Creating Groups of Equal Size

triley — Mon, 09 May 2016 13:07:19 GMT

All,

I have modified my question slightly since I don't believe I explained it properly the first time.

I am trying to create an experiment and want to design it in a way that I have 4 groups, that are similar among a few variables. These variables can be different things, so "x" could be "# of site visits", and "y" could be "conversion rate" for example. So the scale could be vastly different for each variable (e.g. x=10, y=15%, z= 1,000). I am not trying to sum the 3 variables together, I want to treat them individually and have similar variable means (or sums) across groups.

So for Group1, I want "x" to have a similar mean as Group2,3,and 4's "x" variable. Same with the "y" and "z" variables as well.

I hope this helps, sorry for the misunderstanding (I will try and be more clear explaining the example in the future).

Thanks,

Tom