Hello,
I need some help with allocating the below records between the 'emp' groups(in this case 5) and have equal number of records between each emp group and similar average balance. So in my below example, I have 5 different emp groups (J1 - J5). I would like each emp have roughly the same number of accounts and average balance. I tried using ranuni method and i was close in getting the even distribution of records but I am getting variance of +/-15% in average balance .
Any help is much appreciated. Thanks for your time!
PS: I use SAS 9.3 Level 1M2 version
Data have;
infile cards truncover ;
input emp $ @4 seq_no @8 balance;
cards;
J1 1 5506.62
J1 2 4445.34
J1 3 4982.28
J1 4 1068.32
J1 5 10144.5
J1 6 1868.69
J1 7 7946.56
J1 8 4943.05
J1 9 4699.76
J1 10 1910.09
J1 11 10553.82
J1 12 5541.65
J1 13 9949.29
J1 14 5844.05
J1 15 2833.19
J1 16 938.32
J1 17 4330.14
J1 18 4752.49
J1 19 5392.26
J1 20 3886.05
J1 21 10284.1
J1 22 1552.63
J1 23 3959.4
J1 24 1304.99
J1 25 6872.43
J1 26 7150.08
J1 27 5800.23
J1 28 9106.95
J1 29 11821.93
J1 30 4457.54
J1 31 3703.41
J1 32 1344.87
J1 33 11077.5
J1 34 5272.23
J1 35 10115.14
J1 36 9628.68
J1 37 3260.5
J1 38 6515.95
J1 39 8632.22
J1 40 7060.3
J1 41 13702.8
J1 42 4244.19
J1 43 5049.49
J1 44 6956.69
J1 45 8012.18
J1 46 4276.94
J1 47 7387.27
J1 48 9576.85
J1 49 8530.5
J1 50 10029.27
J1 51 9178.78
J1 52 3284.69
J1 53 4685.45
J1 54 5914.98
J1 55 8945.14
J1 56 5845.33
J1 57 5526.22
J1 58 4247.63
J1 59 1721.22
J1 60 7892.49
J1 61 3017.27
J1 62 13174.94
J1 63 4682.69
J1 64 3099.51
J1 65 7927.11
J1 66 2266.1
J2 67 10093.46
J2 68 1490.46
J2 69 3492.93
J2 70 2840
J2 71 5429.05
J2 72 6341.22
J2 73 5459.4
J2 74 6399.17
J2 75 4088.12
J2 76 2900.45
J2 77 11685.52
J2 78 9115.25
J2 79 8330.79
J2 80 9618.94
J2 81 3753.83
J2 82 4959.81
J2 83 7779.67
J2 84 2017.49
J2 85 4264.38
J2 86 6499.82
J2 87 1785.12
J2 88 12105.83
J2 89 5614.56
J2 90 14225.61
J2 91 8052.63
J2 92 3254.24
J2 93 2490.86
J2 94 13576.87
J2 95 2028.83
J2 96 7902.67
J2 97 4645.47
J2 98 8842.38
J2 99 4042.58
J2 100 8694.63
J2 101 3456.95
J2 102 4615.43
J2 103 7713.53
J2 104 718.39
J2 105 4395.64
J2 106 2259.29
J2 107 4646.16
J2 108 2983.14
J2 109 10162.12
J2 110 5574.53
J2 111 3484.22
J2 112 5474.14
J2 113 5658.38
J2 114 7439.19
J2 115 8482.85
J2 116 6630.83
J2 117 9068.14
J2 118 3474.45
J2 119 4287.26
J2 120 2594.75
J2 121 4282.12
J2 122 6231.82
J2 123 7366.49
J2 124 7550.73
J2 125 2129.51
J2 126 2121.11
J2 127 1932.94
J2 128 1762.94
J2 129 1671.35
J2 130 3339.73
J2 131 8972.37
J2 132 2046.71
J2 133 1458.02
J3 134 7520.15
J3 135 5484.75
J3 136 3330.05
J3 137 1116.4
J3 138 5859.14
J3 139 1497.65
J3 140 4775.76
J3 141 8555.46
J3 142 7996.91
J3 143 6613.29
J3 144 9360.45
J3 145 7169.56
J3 146 5206.92
J3 147 2285.22
J3 148 3409.55
J3 149 5558.47
J3 150 3414.76
J3 151 4941.93
J3 152 5026.55
J3 153 17441.91
J3 154 11996.15
J3 155 2214.9
J3 156 13379.87
J3 157 13279.78
J3 158 3902.16
J3 159 13927.65
J3 160 1627.2
J3 161 2923.37
J3 162 6196.44
J3 163 916.52
J3 164 5030.07
J3 165 13169.17
J3 166 8701.66
J3 167 6075.43
J3 168 6926.28
J3 169 6852.55
J3 170 2247.94
J3 171 10829.27
J3 172 551.79
J3 173 1929.3
J3 174 3731.24
J3 175 9776.04
J3 176 5181.06
J3 177 3682.39
J3 178 2156.49
J3 179 4669.62
J3 180 13885.67
J3 181 3865.06
J3 182 1091.13
J3 183 3393.74
J3 184 3250.23
J3 185 5767.59
J3 186 3028.08
J3 187 3844.98
J3 188 6529.34
J3 189 5587.37
J3 190 5372.47
J3 191 5010.33
J3 192 7486.74
J3 193 3792.56
J3 194 3276.91
J3 195 6841.16
J3 196 1547.58
J3 197 10617.52
J3 198 5289.83
J3 199 6797.35
J3 200 1552.81
J4 201 5464.11
J4 202 5250.88
J4 203 2967.15
J4 204 3001.34
J4 205 2469.81
J4 206 6159.1
J4 207 18236.87
J4 208 4146.01
J4 209 4181.2
J4 210 4845.92
J4 211 3900.74
J4 212 4044.76
J4 213 5583.74
J4 214 11171.42
J4 215 6559.74
J4 216 2113.06
J4 217 7706.08
J4 218 2265.58
J4 219 11675
J4 220 2330.78
J4 221 1839.42
J4 222 950.52
J4 223 8187.52
J4 224 1618.09
J4 225 12827.78
J4 226 11823.57
J4 227 11283.39
J4 228 2557.51
J4 229 7271.31
J4 230 14313.68
J4 231 938.83
J4 232 11395.42
J4 233 1885.73
J4 234 2749.82
J4 235 6208.5
J4 236 15990.48
J4 237 3800.09
J4 238 11830.88
J4 239 8177.65
J4 240 7573.48
J4 241 7108.93
J4 242 11072.6
J4 243 786.69
J4 244 10015.09
J4 245 8225.05
J4 246 11301.45
J4 247 5019.94
J4 248 7336.35
J4 249 3376.27
J4 250 9291.02
J4 251 5172.86
J4 252 14177.38
J4 253 5145.48
J4 254 5521.57
J4 255 9790.25
J4 256 6854.89
J4 257 7230.76
J4 258 12018.08
J4 259 3122.39
J4 260 10447.19
J4 261 7411.74
J4 262 2145.38
J4 263 4758.72
J4 264 11449.59
J4 265 4042.03
J4 266 2276.59
J4 267 4173.79
J5 268 3324.67
J5 269 4004.29
J5 270 6244.66
J5 271 2837.2
J5 272 1994.16
J5 273 14528.07
J5 274 6016.3
J5 275 4001.56
J5 276 3864
J5 277 11359.22
J5 278 17987.13
J5 279 5065
J5 280 5003.77
J5 281 7351.86
J5 282 4258.13
J5 283 5071.63
J5 284 1491.92
J5 285 14992.9
J5 286 4688.95
J5 287 6406.01
J5 288 14341.07
J5 289 4798.86
J5 290 4489.54
J5 291 6307.95
J5 292 878.45
J5 293 15118.05
J5 294 10127.4
J5 295 6448.01
J5 296 2663.35
J5 297 8334.01
J5 298 2106.6
J5 299 14050.38
J5 300 10188.66
J5 301 2247.35
J5 302 4366.07
J5 303 6054.6
J5 304 6239.24
J5 305 7737.8
J5 306 12247.55
J5 307 2480.11
J5 308 9215.99
J5 309 4460.02
J5 310 5917.55
J5 311 9174.03
J5 312 13528.88
J5 313 10732.59
J5 314 2943.84
J5 315 5585.36
J5 316 3789.01
J5 317 4470.24
J5 318 1817
J5 319 6863.53
J5 320 6092.49
J5 321 5595.23
J5 322 2951.34
J5 323 5719.28
J5 324 7412.93
J5 325 10039.05
J5 326 5626.5
J5 327 8118.97
J5 328 9282.35
J5 329 3877.66
J5 330 8677.22
J5 331 6899.75
J5 332 7139.57
J5 333 5346.63
J5 334 7921.07
J5 335 7784.76
J5 336 45464.87
J5 337 8950.05
J5 338 12569.27
;
run;
Here are two simple heuristics to get groups of positive numbers (balance) with approximately equal means.
The idea for the first one is to build groups by adding the largest value to the smallest running sum, the second largest value to the second smallest running sum, and so on.
The second algorithm simply spreads like-values among the groups in alternating descending and ascending sequences.
/* Number of groups to create */
%let ng=5;
proc sort data=have; by descending balance; run;
data want;
array s{&ng} _temporary_;
array _sm{&ng};
/* Initialize running sums with small tie-breaking values */
if _n_=1 then
do i = 1 to &ng;
s{i} = i * 0.000001;
end;
/* Find the order of each group running sum */
do i = 1 to &ng;
_sm{whichn(smallest(i, of s{*}), of s{*})} = i;
end;
/* Read balances (in decreasing order) and assign them to groups
in ascending order of running sums. Update running sums. */
do i = 1 to &ng until(done);
set have end=done;
group = _sm{i};
s{group} = sum(s{group}, balance);
output;
end;
drop _sm: i;
run;
title "Group means and spread";
proc sql;
create table means as
select group, mean(balance) as Mean, count(balance) as N
from want group by group;
select * from means;
select range(mean) as meansRange, cv(mean) as meansCVpct from means;
quit;
/* Simpler version */
data want2;
/* Read balances (in decreasing order) and assign them to groups
alternatively in sequence and reverse sequence. */
do i = 1 to &ng, &ng to 1 until(done);
set have end=done;
group = i;
output;
end;
drop i;
run;
title "Group means and spread, version 2";
proc sql;
create table means2 as
select group, mean(balance) as Mean, count(balance) as N
from want2 group by group;
select * from means2;
select range(mean) as meansRange, cv(mean) as meansCVpct from means2;
quit;
Group means and spread group Mean N 1 6619.149 67 2 6193.301 68 3 6193.111 68 4 6206.792 67 5 6240.393 68 Group means and spread meansRange meansCVpct 426.0376 2.936192 Group means and spread, version 2 group Mean N 1 6673.84 68 2 6232.763 68 3 6178.101 68 4 6211.497 67 5 6149.711 67 Group means and spread, version 2 meansRange meansCVpct 524.1291 3.455997
Not as good as @FreelanceReinh results, but these algorithms are deterministic (they don't involve random numbers) and require far less computation. I couldn't test the code with SAS version 9.3, but it should work.
Hello @vicky07,
Do you have a SAS/OR license? I don't, but I'm sure the optimizers available there could produce an excellent solution.
Otherwise, is SAS/STAT available at your site? Then you could let PROC SURVEYSELECT create a number of random group assignments (with virtually equal group sizes, in your case: 67, 67, 68, 68, 68) and then pick the assignment with the smallest variance between the five group means of balance. Of course, increasing the number of test assignments will tend to improve (i.e. decrease) the minimum variance.
With your sample data (read with the below INPUT statement, though)
input emp $ seq_no balance;
and the following steps
/* Create 10000 random group assignments */
proc surveyselect data=have(drop=emp) out=test rep=10000 groups=5 seed=3141592;
run; /* 10000*338 = 3380000 obs. */
/* Compute mean balance for each group */
proc summary data=test nway;
by replicate;
class groupid;
var balance;
output out=stats mean=mbal;
run; /* 10000*5 = 50000 obs. */
/* Compute variance of the 5 mean balances for each assignment */
proc summary data=stats;
by replicate;
var mbal;
output out=mstats var=var_mbal;
run; /* 10000 obs. */
/* Determine the assignment with minimum variance */
proc summary data=mstats;
var var_mbal;
output out=opt min= minid(var_mbal(replicate))=repopt;
run; /* 1 obs. */
proc print data=opt;
run; /* var_mbal=460.768, repopt=2378 */
/* Select the "optimum" assignment */
data want(drop=r: groupid);
length emp $2;
if _n_=1 then set opt(keep=r:);
set test;
if replicate=repopt;
emp=cats('J',groupid);
run; /* 338 obs. */
/* Final check of group sizes and group means */
proc means data=want maxdec=2;
class emp;
var balance;
run;
I obtained:
Analysis Variable : balance N emp Obs N Mean Std Dev Minimum Maximum --------------------------------------------------------------------------------- J1 67 67 6302.88 3782.57 786.69 17987.13 J2 67 67 6256.40 5918.76 718.39 45464.87 J3 68 68 6297.45 3580.75 551.79 14225.61 J4 68 68 6310.54 3343.95 1304.99 14992.90 J5 68 68 6281.56 3942.35 938.83 18236.87 ---------------------------------------------------------------------------------
So, the five means differ by <1%. Would this be close enough? If not, you could increase rep= or try different seed= values.
If only Base SAS is available to you, you'll need to replace the PROC SURVEYSELECT step by suitable DATA and possibly PROC SORT steps. Please don't hesitate to ask if you need help with that.
ADDENDUM
A little bird told me that the convenient GROUPS= option of PROC SURVEYSELECT was not yet available in SAS version 9.3. Sorry that I hadn't checked this in the old documentation. For @vicky07 and all other users still on v9.3 (or even v9.2) and also for users without access to SAS/STAT here comes a workaround (as mentioned above): Simply replace the PROC SURVEYSELECT step by the code below.
/* Define number of groups and number of replicates */
%let ngroups=5;
%let nrep=10000;
/* Create list of group sizes and group IDs (1, 2, ...) for array definition */
data _null_;
length c $999;
call symputx('_n', n);
r=mod(n,&ngroups);
s=floor(n/&ngroups);
do i=1 to &ngroups-r;
c=catx(' ',c,cats(s,'*',i));
end;
do i=i to &ngroups;
c=catx(' ',c,cats(s+1,'*',i));
end;
call symputx('_gs', c);
stop;
set have nobs=n;
run;
%put Group sizes * group IDs: &_gs;
/* Create &nrep random group assignments */
data test;
array _g[&_n] _temporary_ (&_gs);
_iorc_=31415927; /* random seed */
do replicate=1 to &nrep;
call ranperm(_iorc_, of _g[*]);
do _n_=1 to &_n;
set have(drop=emp) point=_n_;
groupid=_g[_n_];
output;
end;
end;
stop;
run; /* &nrep*&_n obs. */
Due to the different approach and the random nature of the group assignments the results are not exactly the same as before. For @vicky07's sample data the new results are even better in terms of variance reduction: var_mbal=297.134 for repopt=1608.
(Okay, I tested a few more random seeds and was lucky to find a good one ... 🙂 )
Analysis Variable : balance N emp Obs N Mean Std Dev Minimum Maximum --------------------------------------------------------------------------------- J1 67 67 6297.45 3718.49 786.69 14225.61 J2 67 67 6293.21 6003.82 718.39 45464.87 J3 68 68 6299.85 3591.68 551.79 15990.48 J4 68 68 6299.39 3712.96 916.52 18236.87 J5 68 68 6259.38 3512.09 950.52 15118.05 ---------------------------------------------------------------------------------
Here are two simple heuristics to get groups of positive numbers (balance) with approximately equal means.
The idea for the first one is to build groups by adding the largest value to the smallest running sum, the second largest value to the second smallest running sum, and so on.
The second algorithm simply spreads like-values among the groups in alternating descending and ascending sequences.
/* Number of groups to create */
%let ng=5;
proc sort data=have; by descending balance; run;
data want;
array s{&ng} _temporary_;
array _sm{&ng};
/* Initialize running sums with small tie-breaking values */
if _n_=1 then
do i = 1 to &ng;
s{i} = i * 0.000001;
end;
/* Find the order of each group running sum */
do i = 1 to &ng;
_sm{whichn(smallest(i, of s{*}), of s{*})} = i;
end;
/* Read balances (in decreasing order) and assign them to groups
in ascending order of running sums. Update running sums. */
do i = 1 to &ng until(done);
set have end=done;
group = _sm{i};
s{group} = sum(s{group}, balance);
output;
end;
drop _sm: i;
run;
title "Group means and spread";
proc sql;
create table means as
select group, mean(balance) as Mean, count(balance) as N
from want group by group;
select * from means;
select range(mean) as meansRange, cv(mean) as meansCVpct from means;
quit;
/* Simpler version */
data want2;
/* Read balances (in decreasing order) and assign them to groups
alternatively in sequence and reverse sequence. */
do i = 1 to &ng, &ng to 1 until(done);
set have end=done;
group = i;
output;
end;
drop i;
run;
title "Group means and spread, version 2";
proc sql;
create table means2 as
select group, mean(balance) as Mean, count(balance) as N
from want2 group by group;
select * from means2;
select range(mean) as meansRange, cv(mean) as meansCVpct from means2;
quit;
Group means and spread group Mean N 1 6619.149 67 2 6193.301 68 3 6193.111 68 4 6206.792 67 5 6240.393 68 Group means and spread meansRange meansCVpct 426.0376 2.936192 Group means and spread, version 2 group Mean N 1 6673.84 68 2 6232.763 68 3 6178.101 68 4 6211.497 67 5 6149.711 67 Group means and spread, version 2 meansRange meansCVpct 524.1291 3.455997
Not as good as @FreelanceReinh results, but these algorithms are deterministic (they don't involve random numbers) and require far less computation. I couldn't test the code with SAS version 9.3, but it should work.
Thanks, @PGStats, for chiming in. This is a great contribution. It appears that your heuristics are particularly effective for larger numbers of smaller groups. I'm sure @vicky07 will appreciate this in view of her real data.
Here's a comparison of the two heuristics and the best-of-10000-random-assignments approach (CALL RANPERM version) using the sample data (N=338) with the number of groups ranging from 2 to 50:
@PGStats : I tried your version1 solution on various sample populations(increased number of groups/decreased group size and vice versa) and it worked perfectly in each scenario. The max mean range I got in my different test scenarios was 800, which is acceptable in my case.
Thank you very much for your help!!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.