Analysis Variable : Cholesterol | |||
---|---|---|---|
COL_MEM | N Obs | Mean | Std Dev |
1 | 630 | 221.9544715 | 43.0901775 |
2 | 662 | 221.9627907 | 43.1227683 |
3 | 629 | 221.9457237 | 43.0978582 |
4 | 648 | 221.9667722 | 43.0939030 |
5 | 649 | 221.9685039 | 43.0944852 |
@Rick_SAS post a optimizal problem. but he solve it by SAS/QC.
@RobPratt want give it a try ?
http://blogs.sas.com/content/iml/2017/05/01/split-data-groups-mean-variance.html
Just FYI, I always alert Rob to any blog posts that involve optimization. If you see a paragraph at the end of one of my posts that says, "and you can also solve this with PROC OPTMODEL...", that information probably came from Rob.
Splitting into groups that have the same sum (or minimize the maximum sum) is straightforward for the MILP solver in PROC OPTMODEL, but if you want the same mean the problem becomes mixed integer nonlinear programming (MINLP), even without considering the standard deviation. PROC OPTMODEL does not yet provide an MINLP solver, and so far the most natural workarounds don’t look promising.
@RobPratt, So Could you write some OR code to solve this problem ? And I also want know whether OR could solve this problem when the dataset has one million obs
Here is PROC OPTMODEL code that uses the MILP solver to minimize the range between maximum and minimum sums and forces the treatments to have equal or nearly equal numbers of units but ignores standard deviation:
proc optmodel;
set TREATMENTS;
read data Treatments into TREATMENTS=[Trt];
set UNITS;
num &Var {UNITS};
read data Units(where=(&Var ne .)) into UNITS=[_N_] &Var;
var Assign {UNITS, TREATMENTS} binary;
con AssignOnce {u in UNITS}:
sum {t in TREATMENTS} Assign[u,t] = 1;
impvar TreatmentSum {t in TREATMENTS} = sum {u in UNITS} &Var.[u] * Assign[u,t];
con Cardinality {t in TREATMENTS}:
floor(card(UNITS) / card(TREATMENTS))
<= sum {u in UNITS} Assign[u,t]
<= ceil(card(UNITS) / card(TREATMENTS));
var MinSum;
var MaxSum;
con MinSumCon {t in TREATMENTS}:
MinSum <= TreatmentSum[t];
con MaxSumCon {t in TREATMENTS}:
MaxSum >= TreatmentSum[t];
min Objective = MaxSum - MinSum;
solve with MILP / decomp;
print TreatmentSum;
num Trt {u in UNITS} = round(sum {t in TREATMENTS} t * Assign[u,t].sol);
create data Groups from [unit] &Var Trt;
quit;
proc means data=Groups mean std;
class Trt;
var &Var;
run;
proc sgplot data=Groups;
vbox &Var / category=Trt;
run;
It takes about a minute and yields the following results:
SAS Output
[1] | TreatmentSum |
---|---|
1 | 139169 |
2 | 139169 |
3 | 139169 |
4 | 139168 |
5 | 139169 |
Analysis Variable : Cholesterol | |||
---|---|---|---|
Trt | N Obs | Mean | Std Dev |
1 | 627 | 221.9601276 | 29.7794342 |
2 | 627 | 221.9601276 | 41.9241290 |
3 | 627 | 221.9601276 | 42.8058251 |
4 | 627 | 221.9585327 | 42.2811774 |
5 | 627 | 221.9601276 | 54.9869575 |
You can also perform an exact linearization to handle means instead of sums, but the resulting problem is much harder. See this blog post from yesterday.
Here is my GA .and got the similar result with @Rick_SAS
data Units;
set Sashelp.Heart(where=(status = "Alive"));
keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
proc iml;
use Units nobs nobs;
read all var {Cholesterol};
close;
start function(x) global(Cholesterol,nobs,group,mean_std);
if countunique(x)=group then do;
do i=1 to group;
idx=loc(x=i);
temp=Cholesterol[idx];
mean_std[i,1]=mean(temp);
mean_std[i,2]=std(temp);
end;
sse=sum(mean_std[<>,]-mean_std[><,]);
end;
else sse=999999;
return (sse);
finish;
group=5; /* <--Change it(divide into 5 groups)*/
mean_std=j(group,2,.);
encoding=j(2,nobs,1);
encoding[2,]=group;
id=gasetup(2,nobs,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = 100;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
create group var {col_mem};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
data want;
merge group Units(keep=Cholesterol);
run;
proc means data=want mean std;
class col_mem;
var Cholesterol;
run;
The MEANS Procedure
Analysis Variable : Cholesterol | |||
---|---|---|---|
COL_MEM | N Obs | Mean | Std Dev |
1 | 630 | 221.9544715 | 43.0901775 |
2 | 662 | 221.9627907 | 43.1227683 |
3 | 629 | 221.9457237 | 43.0978582 |
4 | 648 | 221.9667722 | 43.0939030 |
5 | 649 | 221.9685039 | 43.0944852 |
I noticed there are some missing value in ,so I fixed it again and looks like better than Rick's result.
data Units;
set Sashelp.Heart(where=(status = "Alive" and Cholesterol is not missing));
n+1;
keep n Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
proc iml;
use Units nobs nobs;
read all var {Cholesterol};
close;
start function(x) global(Cholesterol,group,mean_std);
xx=shape(x,group,0,.);
do i=1 to group;
idx=setdif(xx[i,],{.});
temp=Cholesterol[idx];
mean_std[i,1]=mean(temp);
mean_std[i,2]=std(temp);
end;
sse=sum(mean_std[<>,]-mean_std[><,]);
return (sse);
finish;
group=5; /* <--Change it(divide into 5 groups)*/
mean_std=j(group,2,.);
id=gasetup(3,nobs,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000);
niter = 100;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=shape(mem,group,0,.);
create group from col_mem;
append from col_mem;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
data group;
set group;
group+1;
run;
proc transpose data=group out=member(drop=_: index=(col1));
by group;
var col:;
run;
data want;
merge member(rename=(col1=n) where=(n is not missing)) Units(keep=n Cholesterol);
by n;
run;
proc means data=want mean std;
class group;
var Cholesterol;
run;
OUTPUT:
The MEANS Procedure
Analysis Variable : Cholesterol | |||
---|---|---|---|
group | N Obs | Mean | Std Dev |
1 | 627 | 221.9649123 | 43.1137738 |
2 | 627 | 221.9665072 | 43.1220553 |
3 | 627 | 221.9505582 | 43.0823468 |
4 | 627 | 221.9569378 | 43.0712286 |
5 | 627 | 221.9601276 | 43.1105645 |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.