## Minimize Correlation between Selected Observations

Occasional Contributor
Posts: 5

# Minimize Correlation between Selected Observations

I am trying to write a model that selects items for an assessment. We have a measure of lexical similarity between item pairs that we would like to minimize. The intent is to select a set of items that are as diverse as possible. The similarity index is essentially a correlation matrix.

The code below seems to do what I want for simple problems (e.g., selecting 15 items from a pool of 100), but it does not scale well when either the pool or the number of items selected increase. Are there any suggestions on how to make this model more efficient? A large percentage of the elements in the correlation matrix are 0, if that affects any suggestions.

Thank you.

``````proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=15;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;

``````
SAS Employee
Posts: 571

## Re: Minimize Correlation between Selected Observations

Assuming CSI[i,j] >= 0, your constraints lin1 and lin2 will naturally be satisfied because of the objective, so you could reduce the problem size by omitting them.  If that doesn't help, please share the data.

Occasional Contributor
Posts: 5

## Re: Minimize Correlation between Selected Observations

The selection variable X determines whether an individual item will appear on the test. Because CSI is about correlations (i.e., relationships between item pairs), I created Ynew to determine whether the pair X[i] and X[j] were both selected. Logically, Ynew[i,j] = X[i]*X[j]. In words, if both i and j are selected, then the item pair comprised of i and j is also selected. If either i or j is not selected, then the pair [i,j] is also not selected. The problem with Ynew[i,j]=X[i]*X[j] is that it is nonlinear. Therefore, I used the constraints lin1, lin2, and lin3 to linearize that relationship. Is there a better way to do this part?

I added some data to the code below. I reduced the example to just 5 items for simplicity.

``````data Itemdata;
input questionid;
datalines;
227447
227450
227451
227452
227481
;

data CSI;
input Q1 Q2 CSI;
datalines;
227447	227447	1
227450	227450	1
227451	227447	0.084515426
227451	227451	1
227452	227447	0.07100716
227452	227450	0.019418391
227452	227452	1
227481	227447	0.135526185
227481	227481	1
;

proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=3;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;``````
SAS Employee
Posts: 571

## Re: Minimize Correlation between Selected Observations

Yes, that is the standard way to model the product of two binary variables.  But because of your minimization objective, the solver naturally "wants" to make Ynew[i,j] small, so without loss of optimality you can omit constraints lin1 and lin2 and enforce the nonlinear relationship Ynew[i,j] >= x[i]*x[j] via the linear constraint lin3.  You will get the same optimal objective value.

If CSI[i,j] = 0, the solver might return Ynew[i,j] = 1 when x[i] = 0 and x[j] = 0.  If it is important that Ynew[i,j] = x[i]*x[j], you can postprocess as follows:

``````proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data CSI into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=3;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
/*	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/*	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;
print X Ynew;

/* postprocess */
for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
print X Ynew;
print totCSI;
quit;``````

Omitting constraints lin1 and lin2 reduces the problem size and will likely speed up the solver.  Please try it on your larger instance and see if it helps.  If it is still too slow, I have a couple of further ideas.

Occasional Contributor
Posts: 5

## Re: Minimize Correlation between Selected Observations

Thank you for that suggestion. I was previously getting out of memory errors almost instantly, and with this modification it ran for more than an hour before giving an out of memory error. If you have other suggestions, I would appreciate them. Thank you.

SAS Employee
Posts: 571

## Re: Minimize Correlation between Selected Observations

Occasional Contributor
Posts: 5

## Re: Minimize Correlation between Selected Observations

The attached data files are more representative of the larger problem. Change the selection in the previous code to 150 instead of 3.

SAS Employee
Posts: 571

## Re: Minimize Correlation between Selected Observations

[ Edited ]

Here are three additional ideas you can try individually or in combination:

1. Relax the Ynew variables to >= 0 instead of binary.  They will automatically take {0,1} values at optimality.

2. Add the following valid constraint, which can be derived by multiplying both sides of TotItems by x[j]:

``````constraint Cut {j in QuestionIDs}:
sum {i in QuestionIDs: j>i} Ynew[i,j]
+ sum {i in QuestionIDs: j<i} Ynew[j,i]
= (150-1)*x[j];
``````

3. Solve the (nonconvex) NLP relaxation of the original quadratic problem and round the resulting solution to get an integer feasible solution, which you can use as a warm start for the MILP solver:

``````proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=150;

min totCSI_quadratic = sum{i in questionIDs, j in questionIDs: j>i} CSI[i,j]*x[i]*x[j];

solve with nlp relaxint / ms;
print {i in QuestionIDs: x[i].sol > 1e-6} x;

/*   var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;*/
var Ynew{i in questionIDs, j in questionIDs: j>i} >= 0;
/* constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/* constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

for {i in QuestionIDs} x[i] = round(x[i]);
for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

/* optional */
*   constraint Cut {j in QuestionIDs}:
sum {i in QuestionIDs: j>i} Ynew[i,j]
+ sum {i in QuestionIDs: j<i} Ynew[j,i]
= (150-1)*x[j];

solve with milp / primalin;

create data outdata from [i]={i in QuestionIDs: x[i].sol > 0.5} x;
quit;``````

Occasional Contributor
Posts: 5

## Re: Minimize Correlation between Selected Observations

Thank you for your help, Rob.

SAS Employee
Posts: 571

## Re: Minimize Correlation between Selected Observations

Glad to help.  In case you want to do a literature search, your problem is equivalent to the maximum edge-weight clique problem (MEWCP), which is NP-hard.  To see the correspondence, just negate your objective coefficients and change min to max.

New Contributor
Posts: 2

## Re: Minimize Correlation between Selected Observations

Hi Robpratt

My problem is on maximizing the correlation with PROC OPTMODEL.

I am new to Proc optmodel.

In your one of the previous post in the year 2012 subject line is Proc OptModel Help on Maximize Correlation), you had provided the solution.

When i tried running the same code in SAS 9.4, I am getting an error message. could you please help on the same

following are  the data and code

data cor;

input org dep;

cards;

11785895 7.42

2335492 7.58

2345245 7.58

2392912 7.53

12755890 7.63

2918402 7.67

2773183 7.68

2824198 7.65

12263433 7.53

;

run;

proc optmodel;
set w;
number org{w};
number dep{w};
var s{w} >= 0 <= .8 init 1;
read data cor into w = [_n_] org dep;

impvar ads {i in w} = org * s;

max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
/ (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) - (sum{i in w}(dep))^2));
solve;
print s correlation;
quit;

Log message after running the code

96 proc optmodel;
NOTE: Writing HTML Body file: sashtml20.htm
597 set w;
598 number org{w};
600 number dep{w};
601 var s{w} >= 0 <= .8 init 1;
602 read data cor into w = [_n_] org dep;
NOTE: There were 9 observations read from the data set WORK.COR.
603
604 impvar ads {i in w} = org * s;
--- ---
517 614

ERROR 614-782: The name 'org' is an array.

604! impvar ads {i in w} = org * s;
-
614
ERROR 614-782: The name 's' is an array.

605
606 max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
--- ---
614 614
ERROR 614-782: The name 'ads' is an array.

606! max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
--- ---
614 614
ERROR 614-782: The name 'dep' is an array.
607 / (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) -
--- ---
614 614
ERROR 614-782: The name 'ads' is an array.

607! / (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) -
---
---
614
614
607! (sum{i in w}(dep))^2));
ERROR 614-782: The name 'dep' is an array.

608 solve;
NOTE: Problem generation will use 4 threads.
NOTE: Previous errors might cause the problem to be resolved incorrectly.
NOTE: The problem has 9 variables (0 free, 0 fixed).
NOTE: The problem has 0 linear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The OPTMODEL presolver is disabled for linear problems.
WARNING: No objective has been specified. A constant zero objective will be used.
NOTE: The problem is a pure network instance. The ALGORITHM=NETWORK option is recommended for
solving problems with this structure.
NOTE: The LP presolver value AUTOMATIC is applied.
NOTE: The LP presolver removed all variables and constraints.
NOTE: Optimal.
NOTE: Objective = 0.
609 print s correlation;
ERROR: The symbol 'correlation' has no value at line 609 column 9.
610 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE OPTMODEL used (Total process time):
real time 1.71 seconds
cpu time 0.79 seconds

Could you please look into that. thanks Ajit

New Contributor
Posts: 3

## Re: Minimize Correlation between Selected Observations

Hi,

1) You have declared ads as both a number and an impvar. You probably want to comment out the number statement

2) I don't know if this has changed from the example you mention, but the 'org' and 's' are arrays so you need the suffix org[i] and s[i] for the impvar statement to work.

3) All 'ads' and 'dep' references in the 'correlation' objective need to specify their array index too, i.e. ads[i] and dep[i]

That should help.

Highlighted
New Contributor
Posts: 2