I am trying to write a model that selects items for an assessment. We have a measure of lexical similarity between item pairs that we would like to minimize. The intent is to select a set of items that are as diverse as possible. The similarity index is essentially a correlation matrix.
The code below seems to do what I want for simple problems (e.g., selecting 15 items from a pool of 100), but it does not scale well when either the pool or the number of items selected increase. Are there any suggestions on how to make this model more efficient? A large percentage of the elements in the correlation matrix are 0, if that affects any suggestions.
Thank you.
proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];
num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;
var x{QuestionIDs} BINARY;
constraint TotItems: sum{i in QuestionIDs}x[i]=15;
var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]1;
min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];
solve with milp;
Assuming CSI[i,j] >= 0, your constraints lin1 and lin2 will naturally be satisfied because of the objective, so you could reduce the problem size by omitting them. If that doesn't help, please share the data.
The selection variable X determines whether an individual item will appear on the test. Because CSI is about correlations (i.e., relationships between item pairs), I created Ynew to determine whether the pair X[i] and X[j] were both selected. Logically, Ynew[i,j] = X[i]*X[j]. In words, if both i and j are selected, then the item pair comprised of i and j is also selected. If either i or j is not selected, then the pair [i,j] is also not selected. The problem with Ynew[i,j]=X[i]*X[j] is that it is nonlinear. Therefore, I used the constraints lin1, lin2, and lin3 to linearize that relationship. Is there a better way to do this part?
I added some data to the code below. I reduced the example to just 5 items for simplicity.
data Itemdata;
input questionid;
datalines;
227447
227450
227451
227452
227481
;
data CSI;
input Q1 Q2 CSI;
datalines;
227447 227447 1
227450 227450 1
227451 227447 0.084515426
227451 227451 1
227452 227447 0.07100716
227452 227450 0.019418391
227452 227452 1
227481 227447 0.135526185
227481 227481 1
;
proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];
num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;
var x{QuestionIDs} BINARY;
constraint TotItems: sum{i in QuestionIDs}x[i]=3;
var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]1;
min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];
solve with milp;
Yes, that is the standard way to model the product of two binary variables. But because of your minimization objective, the solver naturally "wants" to make Ynew[i,j] small, so without loss of optimality you can omit constraints lin1 and lin2 and enforce the nonlinear relationship Ynew[i,j] >= x[i]*x[j] via the linear constraint lin3. You will get the same optimal objective value.
If CSI[i,j] = 0, the solver might return Ynew[i,j] = 1 when x[i] = 0 and x[j] = 0. If it is important that Ynew[i,j] = x[i]*x[j], you can postprocess as follows:
proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];
num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data CSI into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;
var x{QuestionIDs} BINARY;
constraint TotItems: sum{i in QuestionIDs}x[i]=3;
var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
/* constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/* constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]1;
min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];
solve with milp;
print X Ynew;
/* postprocess */
for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
print X Ynew;
print totCSI;
quit;
Omitting constraints lin1 and lin2 reduces the problem size and will likely speed up the solver. Please try it on your larger instance and see if it helps. If it is still too slow, I have a couple of further ideas.
Thank you for that suggestion. I was previously getting out of memory errors almost instantly, and with this modification it ran for more than an hour before giving an out of memory error. If you have other suggestions, I would appreciate them. Thank you.
Please share your data, and I'll take a look.
The attached data files are more representative of the larger problem. Change the selection in the previous code to 150 instead of 3.
Here are three additional ideas you can try individually or in combination:
1. Relax the Ynew variables to >= 0 instead of binary. They will automatically take {0,1} values at optimality.
2. Add the following valid constraint, which can be derived by multiplying both sides of TotItems by x[j]:
constraint Cut {j in QuestionIDs}:
sum {i in QuestionIDs: j>i} Ynew[i,j]
+ sum {i in QuestionIDs: j<i} Ynew[j,i]
= (1501)*x[j];
3. Solve the (nonconvex) NLP relaxation of the original quadratic problem and round the resulting solution to get an integer feasible solution, which you can use as a warm start for the MILP solver:
proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];
num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;
var x{QuestionIDs} BINARY;
constraint TotItems: sum{i in QuestionIDs}x[i]=150;
min totCSI_quadratic = sum{i in questionIDs, j in questionIDs: j>i} CSI[i,j]*x[i]*x[j];
solve with nlp relaxint / ms;
print {i in QuestionIDs: x[i].sol > 1e6} x;
/* var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;*/
var Ynew{i in questionIDs, j in questionIDs: j>i} >= 0;
/* constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/* constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]1;
for {i in QuestionIDs} x[i] = round(x[i]);
for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];
/* optional */
* constraint Cut {j in QuestionIDs}:
sum {i in QuestionIDs: j>i} Ynew[i,j]
+ sum {i in QuestionIDs: j<i} Ynew[j,i]
= (1501)*x[j];
solve with milp / primalin;
create data outdata from [i]={i in QuestionIDs: x[i].sol > 0.5} x;
quit;
Thank you for your help, Rob.
Glad to help. In case you want to do a literature search, your problem is equivalent to the maximum edgeweight clique problem (MEWCP), which is NPhard. To see the correspondence, just negate your objective coefficients and change min to max.
Hi Robpratt
My problem is on maximizing the correlation with PROC OPTMODEL.
I am new to Proc optmodel.
In your one of the previous post in the year 2012 subject line is Proc OptModel Help on Maximize Correlation), you had provided the solution.
When i tried running the same code in SAS 9.4, I am getting an error message. could you please help on the same
following are the data and code
data cor;
input org dep;
cards;
11785895 7.42
2335492 7.58
2345245 7.58
2392912 7.53
12755890 7.63
2918402 7.67
2773183 7.68
2824198 7.65
12263433 7.53
;
run;
proc optmodel;
set w;
number org{w};
number ads{w};
number dep{w};
var s{w} >= 0 <= .8 init 1;
read data cor into w = [_n_] org dep;
impvar ads {i in w} = org * s;
max correlation = (10*sum{i in w}(ads*dep)  (sum{i in w}(ads)) * (sum{i in w}(dep)))
/ (sqrt(10*sum{i in w}(ads^2)  (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2)  (sum{i in w}(dep))^2));
solve;
print s correlation;
quit;
Log message after running the code
96 proc optmodel;
NOTE: Writing HTML Body file: sashtml20.htm
597 set w;
598 number org{w};
599 number ads{w};
600 number dep{w};
601 var s{w} >= 0 <= .8 init 1;
602 read data cor into w = [_n_] org dep;
NOTE: There were 9 observations read from the data set WORK.COR.
603
604 impvar ads {i in w} = org * s;
 
517 614
ERROR 517782: The name 'ads' is already declared.
ERROR 614782: The name 'org' is an array.
604! impvar ads {i in w} = org * s;

614
ERROR 614782: The name 's' is an array.
605
606 max correlation = (10*sum{i in w}(ads*dep)  (sum{i in w}(ads)) * (sum{i in w}(dep)))
 
614 614
ERROR 614782: The name 'ads' is an array.
606! max correlation = (10*sum{i in w}(ads*dep)  (sum{i in w}(ads)) * (sum{i in w}(dep)))
 
614 614
ERROR 614782: The name 'dep' is an array.
607 / (sqrt(10*sum{i in w}(ads^2)  (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) 
 
614 614
ERROR 614782: The name 'ads' is an array.
607! / (sqrt(10*sum{i in w}(ads^2)  (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) 


614
614
607! (sum{i in w}(dep))^2));
ERROR 614782: The name 'dep' is an array.
608 solve;
NOTE: Problem generation will use 4 threads.
NOTE: Previous errors might cause the problem to be resolved incorrectly.
NOTE: The problem has 9 variables (0 free, 0 fixed).
NOTE: The problem has 0 linear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The OPTMODEL presolver is disabled for linear problems.
WARNING: No objective has been specified. A constant zero objective will be used.
NOTE: The problem is a pure network instance. The ALGORITHM=NETWORK option is recommended for
solving problems with this structure.
NOTE: The LP presolver value AUTOMATIC is applied.
NOTE: The LP presolver removed all variables and constraints.
NOTE: Optimal.
NOTE: Objective = 0.
609 print s correlation;
ERROR: The symbol 'correlation' has no value at line 609 column 9.
610 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE OPTMODEL used (Total process time):
real time 1.71 seconds
cpu time 0.79 seconds
Could you please look into that. thanks Ajit
Hi,
1) You have declared ads as both a number and an impvar. You probably want to comment out the number statement
2) I don't know if this has changed from the example you mention, but the 'org' and 's' are arrays so you need the suffix org[i] and s[i] for the impvar statement to work.
3) All 'ads' and 'dep' references in the 'correlation' objective need to specify their array index too, i.e. ads[i] and dep[i]
That should help.
Thank you so much for the quick solution.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.