Hi,
in the dataset have, for each group, there is the distance between each subject and many areas.
in the dataset have_contraints, for each group, there is the number of subjects that belongs to each area.
I would like to assign each subject to only one area minimizing the sum of distances within each group given the constraint that the number of subjects in each area within the group is that of table
have_contraints.
I would appreciate any solution with proc optmodel and/or with IML. As I am not proficient in any of them I would also appreciate very much a description of the code.
thank you very much in advance
You can model this problem as mixed integer linear programming, with a binary decision variable that indicates whether each group-subject-area is assigned. It turns out that the resulting problem is totally unimodular, which means that you can instead solve the linear programming relaxation and still get integer values for the variables. The following PROC OPTMODEL code illustrates four alternatives, but you need only one SOLVE statement:
proc optmodel;
/* declare parameters and read data */
set <num,str> GROUPS_AREAS;
num n {GROUPS_AREAS};
read data lib.have_constraints into GROUPS_AREAS=[group area] n;
set <num,num,str> GROUPS_SUBJECTS_AREAS;
num distance {GROUPS_SUBJECTS_AREAS};
read data lib.have into GROUPS_SUBJECTS_AREAS=[group subject area] distance;
set GROUPS_SUBJECTS = setof {<g,s,a> in GROUPS_SUBJECTS_AREAS} <g,s>;
/* declare variables */
/* Assign[g,s,a] = 1 if group g and area a is assigned subject s; 0 otherwise */
var Assign {GROUPS_SUBJECTS_AREAS} binary;
/* declare objective */
min TotalDistance = sum {<g,s,a> in GROUPS_SUBJECTS_AREAS} distance[g,s,a] * Assign[g,s,a];
/* declare constraints */
con OneAreaPerGroupSubject {<g,s> in GROUPS_SUBJECTS}:
sum {<(g),(s),a> in GROUPS_SUBJECTS_AREAS} Assign[g,s,a] = 1;
con CardinalityPerGroupArea {<g,a> in GROUPS_AREAS}:
sum {<(g),s,(a)> in GROUPS_SUBJECTS_AREAS} Assign[g,s,a] = n[g,a];
/* call MILP solver */
solve;
/* call MILP solver with decomposition algorithm */
solve with milp / decomp=(method=concomp);
/* call LP solver */
solve relaxint;
/* call LP solver with network simplex algorithm */
solve relaxint with lp / algorithm=ns;
/* create output data */
create data lib.want from [group subject area]={<g,s,a> in GROUPS_SUBJECTS_AREAS: Assign[g,s,a].sol > 0.5} distance;
quit;
You can also model this problem as minimum-cost network flow and use the network solver with the MINCOSTFLOW option:
proc optmodel;
/* declare parameters and read data */
set <num,str> GROUPS_AREAS;
num n {GROUPS_AREAS};
read data lib.have_constraints into GROUPS_AREAS=[group area] n;
set <num,num,str> GROUPS_SUBJECTS_AREAS;
num distance {GROUPS_SUBJECTS_AREAS};
read data lib.have into GROUPS_SUBJECTS_AREAS=[group subject area] distance;
set GROUPS_SUBJECTS = setof {<g,s,a> in GROUPS_SUBJECTS_AREAS} <g,s>;
/* construct bipartite network */
num numNodes init 0;
set NODES = 1..numNodes;
num group {NODES};
num subject {NODES} init .;
str area {NODES} init '';
num supply {NODES};
num nodeId_g_s {GROUPS_SUBJECTS};
num nodeId_g_a {GROUPS_AREAS};
for {<g,s> in GROUPS_SUBJECTS} do;
numNodes = numNodes + 1;
group[numNodes] = g;
subject[numNodes] = s;
supply[numNodes] = 1;
nodeId_g_s[g,s] = numNodes;
end;
for {<g,a> in GROUPS_AREAS} do;
numNodes = numNodes + 1;
group[numNodes] = g;
area[numNodes] = a;
supply[numNodes] = -n[g,a];
nodeId_g_a[g,a] = numNodes;
end;
set <num,num> ARCS init {};
num weight {ARCS};
for {<g,s,a> in GROUPS_SUBJECTS_AREAS} do;
ARCS = ARCS union {<nodeId_g_s[g,s],nodeId_g_a[g,a]>};
weight[nodeId_g_s[g,s],nodeId_g_a[g,a]] = distance[g,s,a];
end;
num flow {ARCS};
/* call network solver */
solve with network / mincostflow
direction=directed nodes=(lower=supply) links=(weight=weight) out=(flow=flow);
/* create output data */
create data lib.want from [from to]={<from,to> in ARCS: flow[from,to] > 0.5}
group[from] subject[from] area[to] distance=weight;
quit;
In SAS Viya, because you have a separate problem for each group, you could also use the BY statement in PROC OPTNETWORK (with MINCOSTFLOW) or the groupBy parameter in the runOptmodel action.
Hi,
in the dataset have, for each group, there is the distance between each subject and many areas.
in the dataset have_contraints, for each group, there is the number of subjects that belongs to each area.
I would like to assign each subject to only one area minimizing the sum of distances within each group given the constraint that the number of subjects in each area within the group is that of table
have_contraints.
I would appreciate any solution with proc optmodel and/or with IML. As I am not proficient in any of them I would also appreciate very much a description of the code.
thank you very much in advance
Why not post it at OR forum :
https://communities.sas.com/t5/Mathematical-Optimization/bd-p/operations_research
or IML forum:
https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml
calling @RobPratt and @Rick_SAS
I think the best choice is using PROC OPTMODEL .
In statistics, this is a clustering problem with a constraint on the size of each cluster. In optimization, you could think of this as a kind of multiple knapsack problem, where you assign objects to "knapsacks" (=areas) so as to optimize some criterion, relative to some constraint. If you are unfamiliar with the basic knapsack problem, perform an internet search for
knapsack problem site:blogs.sas.com
you can find several articles by myself (using IML) and several responses from Rob (using PROC OPTMODEL).
I think your data are incomplete. To solve this problem, I believe you need the distance from every subject to each of the 19 areas. Your data is missing some distances, as shown by running PROC FREQ:
proc freq data=D.have;
tables area;
run;
You can see that some of the areas (11, 14, 17,...) only have about 100 subjects for which the distance is known whereas most areas have 228 subjects whose distance is known.
I'm sure that you want to learn how to solve these problems yourself, so let me provide some advice. When I don't know how to solve a complicated problem, I simplify the problem and learn how to solve the simpler problem. For this problem, the simplest situation is the case of two areas. Try to solve the problem with two areas and 5 obs, where the first area must have N1=2 obs and the second area must have N2=3 obs. When you get a solution for the simple situation, post it. You will then be ready to understand the larger problem.
Thank you Rick for your message.
the reason why the freq does not return the same number of records for each area is that each group has different areas, or in other terms the fact the two groups has some areas in common is just a coincidence.
rerunning the freq as follows all areas has the same number of observations, within each group.
proc freq data=prova.have;
tables group*area /list missing;
run;
The problem I posted is a minimization problem within group
I will try to follow you suggestion, but since I am short of time I would appreciate very much any code.
> The problem I posted is a minimization problem within group
Another suggestion: think of this as an assignment problem.
You have S subjects and G groups. Each group has a specified capacity: Group G1 must contain N1 subjects; G2 must contain N2 subjects, and so on.
You are looking for a discrete vector of length S with values in the range 1-G such that the vector contains exactly N1 1s, exactly N2 2s, and so forth, and such that the sum of the distances for the subject-group assignment is minimized.
You can model this problem as mixed integer linear programming, with a binary decision variable that indicates whether each group-subject-area is assigned. It turns out that the resulting problem is totally unimodular, which means that you can instead solve the linear programming relaxation and still get integer values for the variables. The following PROC OPTMODEL code illustrates four alternatives, but you need only one SOLVE statement:
proc optmodel;
/* declare parameters and read data */
set <num,str> GROUPS_AREAS;
num n {GROUPS_AREAS};
read data lib.have_constraints into GROUPS_AREAS=[group area] n;
set <num,num,str> GROUPS_SUBJECTS_AREAS;
num distance {GROUPS_SUBJECTS_AREAS};
read data lib.have into GROUPS_SUBJECTS_AREAS=[group subject area] distance;
set GROUPS_SUBJECTS = setof {<g,s,a> in GROUPS_SUBJECTS_AREAS} <g,s>;
/* declare variables */
/* Assign[g,s,a] = 1 if group g and area a is assigned subject s; 0 otherwise */
var Assign {GROUPS_SUBJECTS_AREAS} binary;
/* declare objective */
min TotalDistance = sum {<g,s,a> in GROUPS_SUBJECTS_AREAS} distance[g,s,a] * Assign[g,s,a];
/* declare constraints */
con OneAreaPerGroupSubject {<g,s> in GROUPS_SUBJECTS}:
sum {<(g),(s),a> in GROUPS_SUBJECTS_AREAS} Assign[g,s,a] = 1;
con CardinalityPerGroupArea {<g,a> in GROUPS_AREAS}:
sum {<(g),s,(a)> in GROUPS_SUBJECTS_AREAS} Assign[g,s,a] = n[g,a];
/* call MILP solver */
solve;
/* call MILP solver with decomposition algorithm */
solve with milp / decomp=(method=concomp);
/* call LP solver */
solve relaxint;
/* call LP solver with network simplex algorithm */
solve relaxint with lp / algorithm=ns;
/* create output data */
create data lib.want from [group subject area]={<g,s,a> in GROUPS_SUBJECTS_AREAS: Assign[g,s,a].sol > 0.5} distance;
quit;
You can also model this problem as minimum-cost network flow and use the network solver with the MINCOSTFLOW option:
proc optmodel;
/* declare parameters and read data */
set <num,str> GROUPS_AREAS;
num n {GROUPS_AREAS};
read data lib.have_constraints into GROUPS_AREAS=[group area] n;
set <num,num,str> GROUPS_SUBJECTS_AREAS;
num distance {GROUPS_SUBJECTS_AREAS};
read data lib.have into GROUPS_SUBJECTS_AREAS=[group subject area] distance;
set GROUPS_SUBJECTS = setof {<g,s,a> in GROUPS_SUBJECTS_AREAS} <g,s>;
/* construct bipartite network */
num numNodes init 0;
set NODES = 1..numNodes;
num group {NODES};
num subject {NODES} init .;
str area {NODES} init '';
num supply {NODES};
num nodeId_g_s {GROUPS_SUBJECTS};
num nodeId_g_a {GROUPS_AREAS};
for {<g,s> in GROUPS_SUBJECTS} do;
numNodes = numNodes + 1;
group[numNodes] = g;
subject[numNodes] = s;
supply[numNodes] = 1;
nodeId_g_s[g,s] = numNodes;
end;
for {<g,a> in GROUPS_AREAS} do;
numNodes = numNodes + 1;
group[numNodes] = g;
area[numNodes] = a;
supply[numNodes] = -n[g,a];
nodeId_g_a[g,a] = numNodes;
end;
set <num,num> ARCS init {};
num weight {ARCS};
for {<g,s,a> in GROUPS_SUBJECTS_AREAS} do;
ARCS = ARCS union {<nodeId_g_s[g,s],nodeId_g_a[g,a]>};
weight[nodeId_g_s[g,s],nodeId_g_a[g,a]] = distance[g,s,a];
end;
num flow {ARCS};
/* call network solver */
solve with network / mincostflow
direction=directed nodes=(lower=supply) links=(weight=weight) out=(flow=flow);
/* create output data */
create data lib.want from [from to]={<from,to> in ARCS: flow[from,to] > 0.5}
group[from] subject[from] area[to] distance=weight;
quit;
In SAS Viya, because you have a separate problem for each group, you could also use the BY statement in PROC OPTNETWORK (with MINCOSTFLOW) or the groupBy parameter in the runOptmodel action.
Thank you very much Rob!
I have a question, maybe just trivial. the objective function seems to be defined over all groups and not within each group. Isn't the solution different from the within group minimization?
Unfortunately I don'have Sas Viya and cannot perform "by group processing".
Another question. my real data is much bigger: a larger number of groups (say about 10k), a huge number of subjects within (some of the groups- up to 1m) and, say, about 8k areas in few groups.
do you foresee problems with the resources and have any suggestion to optimize the code?
First a clarification about BY-group processing for other users who do have SAS Viya: the BY statement for PROC OPTNETWORK does not apply when there are multiple input tables, as in this use case. The groupBy parameter for runOptmodel applies for any number of tables.
Regarding the objective being a sum of groups, when you have a bunch of disjoint problems like this, it is mathematically equivalent to solve them individually or together as one big problem. But it is usually computationally better (both time and memory) to solve them individually, and the decomp=(method=concomp), algorithm=ns, and solve with network approaches all automatically split the big problem into one problem per group. All the approaches I have illustrated yield the same optimal objective value of 9069.4.
In SAS 9.4, the most memory-efficient approach is to use PROC OPTNET (which for minimum-cost network flow problems calls the same network simplex algorithm):
/* node data */
proc sql;
create table lib.group_subject as
select distinct group, subject, cat(group,'_',subject) as node, 1 as weight
from lib.have;
quit;
data lib.group_area;
set lib.have_constraints;
node = cat(group,'_',area);
weight = -n;
run;
data NodeSetIn;
set lib.group_subject lib.group_area;
run;
/* link data */
data LinkSetIn;
set lib.have;
from = cat(group,'_',subject);
to = cat(group,'_',area);
run;
proc optnet direction=directed nodes=NodeSetIn links=LinkSetIn out_links=out_links;
links_var weight=distance;
mincostflow;
run;
But for large enough data, you will run out of memory. So I recommend splitting your 10k groups into smaller batches and calling PROC OPTNET once per batch.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.