Operations Research topics: SAS/OR,
SAS Optimization, and SAS Simulation Studio

Minimize Correlation between Selected Observations

Reply
Occasional Contributor
Posts: 5

Minimize Correlation between Selected Observations

I am trying to write a model that selects items for an assessment. We have a measure of lexical similarity between item pairs that we would like to minimize. The intent is to select a set of items that are as diverse as possible. The similarity index is essentially a correlation matrix.

 

The code below seems to do what I want for simple problems (e.g., selecting 15 items from a pool of 100), but it does not scale well when either the pool or the number of items selected increase. Are there any suggestions on how to make this model more efficient? A large percentage of the elements in the correlation matrix are 0, if that affects any suggestions. 

 

Thank you.

 

proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=15;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
	constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;

SAS Employee
Posts: 499

Re: Minimize Correlation between Selected Observations

Posted in reply to CaseyCodd

Assuming CSI[i,j] >= 0, your constraints lin1 and lin2 will naturally be satisfied because of the objective, so you could reduce the problem size by omitting them.  If that doesn't help, please share the data.

Occasional Contributor
Posts: 5

Re: Minimize Correlation between Selected Observations

The selection variable X determines whether an individual item will appear on the test. Because CSI is about correlations (i.e., relationships between item pairs), I created Ynew to determine whether the pair X[i] and X[j] were both selected. Logically, Ynew[i,j] = X[i]*X[j]. In words, if both i and j are selected, then the item pair comprised of i and j is also selected. If either i or j is not selected, then the pair [i,j] is also not selected. The problem with Ynew[i,j]=X[i]*X[j] is that it is nonlinear. Therefore, I used the constraints lin1, lin2, and lin3 to linearize that relationship. Is there a better way to do this part?

 

I added some data to the code below. I reduced the example to just 5 items for simplicity.

 

data Itemdata;
input questionid;
datalines;
227447
227450
227451
227452
227481
;

data CSI;
input Q1 Q2 CSI;
datalines;
227447	227447	1
227450	227450	1
227451	227447	0.084515426
227451	227451	1
227452	227447	0.07100716
227452	227450	0.019418391
227452	227452	1
227481	227447	0.135526185
227481	227481	1
;

proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=3;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
	constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;
SAS Employee
Posts: 499

Re: Minimize Correlation between Selected Observations

Posted in reply to CaseyCodd

Yes, that is the standard way to model the product of two binary variables.  But because of your minimization objective, the solver naturally "wants" to make Ynew[i,j] small, so without loss of optimality you can omit constraints lin1 and lin2 and enforce the nonlinear relationship Ynew[i,j] >= x[i]*x[j] via the linear constraint lin3.  You will get the same optimal objective value.

 

If CSI[i,j] = 0, the solver might return Ynew[i,j] = 1 when x[i] = 0 and x[j] = 0.  If it is important that Ynew[i,j] = x[i]*x[j], you can postprocess as follows:

proc optmodel;
   set QuestionIDs;
   read data ItemData into QuestionIDs = [questionid];

   num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
   read data CSI into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

   var x{QuestionIDs} BINARY;

   constraint TotItems: sum{i in QuestionIDs}x[i]=3;

   var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
/*	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/*	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
   constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

   min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

   solve with milp;
   print X Ynew;

   /* postprocess */
   for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
   print X Ynew;
   print totCSI;
quit;

Omitting constraints lin1 and lin2 reduces the problem size and will likely speed up the solver.  Please try it on your larger instance and see if it helps.  If it is still too slow, I have a couple of further ideas.

Occasional Contributor
Posts: 5

Re: Minimize Correlation between Selected Observations

Thank you for that suggestion. I was previously getting out of memory errors almost instantly, and with this modification it ran for more than an hour before giving an out of memory error. If you have other suggestions, I would appreciate them. Thank you.

SAS Employee
Posts: 499

Re: Minimize Correlation between Selected Observations

Posted in reply to CaseyCodd

Please share your data, and I'll take a look.

Occasional Contributor
Posts: 5

Re: Minimize Correlation between Selected Observations

The attached data files are more representative of the larger problem. Change the selection in the previous code to 150 instead of 3.

SAS Employee
Posts: 499

Re: Minimize Correlation between Selected Observations

[ Edited ]
Posted in reply to CaseyCodd

Here are three additional ideas you can try individually or in combination:

1. Relax the Ynew variables to >= 0 instead of binary.  They will automatically take {0,1} values at optimality.

2. Add the following valid constraint, which can be derived by multiplying both sides of TotItems by x[j]:

constraint Cut {j in QuestionIDs}: 
      sum {i in QuestionIDs: j>i} Ynew[i,j] 
    + sum {i in QuestionIDs: j<i} Ynew[j,i] 
    = (150-1)*x[j];

3. Solve the (nonconvex) NLP relaxation of the original quadratic problem and round the resulting solution to get an integer feasible solution, which you can use as a warm start for the MILP solver:

proc optmodel;
   set QuestionIDs;
   read data ItemData into QuestionIDs = [questionid];

   num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
   read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

   var x{QuestionIDs} BINARY;

   constraint TotItems: sum{i in QuestionIDs}x[i]=150;

   min totCSI_quadratic = sum{i in questionIDs, j in questionIDs: j>i} CSI[i,j]*x[i]*x[j];

   solve with nlp relaxint / ms;
   print {i in QuestionIDs: x[i].sol > 1e-6} x;

/*   var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;*/
   var Ynew{i in questionIDs, j in questionIDs: j>i} >= 0;
/* constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/* constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
   constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

   for {i in QuestionIDs} x[i] = round(x[i]);
   for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
   min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

   /* optional */
*   constraint Cut {j in QuestionIDs}: 
      sum {i in QuestionIDs: j>i} Ynew[i,j] 
    + sum {i in QuestionIDs: j<i} Ynew[j,i] 
    = (150-1)*x[j];

   solve with milp / primalin;

   create data outdata from [i]={i in QuestionIDs: x[i].sol > 0.5} x;
quit;

 

Occasional Contributor
Posts: 5

Re: Minimize Correlation between Selected Observations

Thank you for your help, Rob.

SAS Employee
Posts: 499

Re: Minimize Correlation between Selected Observations

Posted in reply to CaseyCodd

Glad to help.  In case you want to do a literature search, your problem is equivalent to the maximum edge-weight clique problem (MEWCP), which is NP-hard.  To see the correspondence, just negate your objective coefficients and change min to max.

Ask a Question
Discussion stats
  • 9 replies
  • 189 views
  • 2 likes
  • 2 in conversation