Re: Minimize Correlation between Selected Observations

CaseyCodd · Posted 12-06-2017 11:22 AM

I am trying to write a model that selects items for an assessment. We have a measure of lexical similarity between item pairs that we would like to minimize. The intent is to select a set of items that are as diverse as possible. The similarity index is essentially a correlation matrix.

The code below seems to do what I want for simple problems (e.g., selecting 15 items from a pool of 100), but it does not scale well when either the pool or the number of items selected increase. Are there any suggestions on how to make this model more efficient? A large percentage of the elements in the correlation matrix are 0, if that affects any suggestions.

Thank you.

proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=15;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
	constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;

RobPratt · Posted 12-06-2017 12:48 PM

Assuming CSI[i,j] >= 0, your constraints lin1 and lin2 will naturally be satisfied because of the objective, so you could reduce the problem size by omitting them. If that doesn't help, please share the data.

CaseyCodd · Posted 12-06-2017 02:09 PM

The selection variable X determines whether an individual item will appear on the test. Because CSI is about correlations (i.e., relationships between item pairs), I created Ynew to determine whether the pair X[i] and X[j] were both selected. Logically, Ynew[i,j] = X[i]*X[j]. In words, if both i and j are selected, then the item pair comprised of i and j is also selected. If either i or j is not selected, then the pair [i,j] is also not selected. The problem with Ynew[i,j]=X[i]*X[j] is that it is nonlinear. Therefore, I used the constraints lin1, lin2, and lin3 to linearize that relationship. Is there a better way to do this part?

I added some data to the code below. I reduced the example to just 5 items for simplicity.

data Itemdata;
input questionid;
datalines;
227447
227450
227451
227452
227481
;

data CSI;
input Q1 Q2 CSI;
datalines;
227447	227447	1
227450	227450	1
227451	227447	0.084515426
227451	227451	1
227452	227447	0.07100716
227452	227450	0.019418391
227452	227452	1
227481	227447	0.135526185
227481	227481	1
;

proc optmodel;
set QuestionIDs;
read data ItemData into QuestionIDs = [questionid];

num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

var x{QuestionIDs} BINARY;

constraint TotItems: sum{i in QuestionIDs}x[i]=3;

var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];
	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];
	constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

solve with milp;

RobPratt · Posted 12-06-2017 02:30 PM

Yes, that is the standard way to model the product of two binary variables. But because of your minimization objective, the solver naturally "wants" to make Ynew[i,j] small, so without loss of optimality you can omit constraints lin1 and lin2 and enforce the nonlinear relationship Ynew[i,j] >= x[i]*x[j] via the linear constraint lin3. You will get the same optimal objective value.

If CSI[i,j] = 0, the solver might return Ynew[i,j] = 1 when x[i] = 0 and x[j] = 0. If it is important that Ynew[i,j] = x[i]*x[j], you can postprocess as follows:

proc optmodel;
   set QuestionIDs;
   read data ItemData into QuestionIDs = [questionid];

   num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
   read data CSI into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

   var x{QuestionIDs} BINARY;

   constraint TotItems: sum{i in QuestionIDs}x[i]=3;

   var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;
/*	constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/*	constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
   constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

   min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

   solve with milp;
   print X Ynew;

   /* postprocess */
   for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
   print X Ynew;
   print totCSI;
quit;

Omitting constraints lin1 and lin2 reduces the problem size and will likely speed up the solver. Please try it on your larger instance and see if it helps. If it is still too slow, I have a couple of further ideas.

CaseyCodd · Posted 12-07-2017 09:32 AM

Thank you for that suggestion. I was previously getting out of memory errors almost instantly, and with this modification it ran for more than an hour before giving an out of memory error. If you have other suggestions, I would appreciate them. Thank you.

RobPratt · Posted 12-07-2017 11:09 AM

Please share your data, and I'll take a look.

CaseyCodd · Posted 12-07-2017 12:09 PM

The attached data files are more representative of the larger problem. Change the selection in the previous code to 150 instead of 3.

RobPratt · Posted 12-08-2017 04:00 PM

Here are three additional ideas you can try individually or in combination:

1. Relax the Ynew variables to >= 0 instead of binary. They will automatically take {0,1} values at optimality.

2. Add the following valid constraint, which can be derived by multiplying both sides of TotItems by x[j]:

constraint Cut {j in QuestionIDs}: 
      sum {i in QuestionIDs: j>i} Ynew[i,j] 
    + sum {i in QuestionIDs: j<i} Ynew[j,i] 
    = (150-1)*x[j];

3. Solve the (nonconvex) NLP relaxation of the original quadratic problem and round the resulting solution to get an integer feasible solution, which you can use as a warm start for the MILP solver:

proc optmodel;
   set QuestionIDs;
   read data ItemData into QuestionIDs = [questionid];

   num CSI {i in QuestionIDs,j in QuestionIDs} init 0;
   read data tall into [Q1 Q2] CSI CSI[Q2,Q1]=CSI;

   var x{QuestionIDs} BINARY;

   constraint TotItems: sum{i in QuestionIDs}x[i]=150;

   min totCSI_quadratic = sum{i in questionIDs, j in questionIDs: j>i} CSI[i,j]*x[i]*x[j];

   solve with nlp relaxint / ms;
   print {i in QuestionIDs: x[i].sol > 1e-6} x;

/*   var Ynew{i in questionIDs, j in questionIDs: j>i} BINARY;*/
   var Ynew{i in questionIDs, j in questionIDs: j>i} >= 0;
/* constraint lin1{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[i];*/
/* constraint lin2{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] <= x[j];*/
   constraint lin3{i in QuestionIDs, j in QuestionIDs: j>i}: Ynew[i,j] >= x[i]+x[j]-1;

   for {i in QuestionIDs} x[i] = round(x[i]);
   for {i in QuestionIDs, j in QuestionIDs: j>i} Ynew[i,j] = x[i]*x[j];
   min totCSI = sum{i in questionIDs, j in questionIDs: j>i}Ynew[i,j]*CSI[i,j];

   /* optional */
*   constraint Cut {j in QuestionIDs}: 
      sum {i in QuestionIDs: j>i} Ynew[i,j] 
    + sum {i in QuestionIDs: j<i} Ynew[j,i] 
    = (150-1)*x[j];

   solve with milp / primalin;

   create data outdata from [i]={i in QuestionIDs: x[i].sol > 0.5} x;
quit;

CaseyCodd · Posted 12-12-2017 09:15 AM

Thank you for your help, Rob.

RobPratt · Posted 12-12-2017 10:25 AM

Glad to help. In case you want to do a literature search, your problem is equivalent to the maximum edge-weight clique problem (MEWCP), which is NP-hard. To see the correspondence, just negate your objective coefficients and change min to max.

deadsea · Posted 02-26-2018 07:03 AM

Hi Robpratt

My problem is on maximizing the correlation with PROC OPTMODEL.

I am new to Proc optmodel.

In your one of the previous post in the year 2012 subject line is Proc OptModel Help on Maximize Correlation), you had provided the solution.

When i tried running the same code in SAS 9.4, I am getting an error message. could you please help on the same

following are the data and code

data cor;

input org dep;

cards;

11785895 7.42

2335492 7.58

2345245 7.58

2392912 7.53

12755890 7.63

2918402 7.67

2773183 7.68

2824198 7.65

12263433 7.53

;

run;

proc optmodel;
set w;
number org{w};
number ads{w};
number dep{w};
var s{w} >= 0 <= .8 init 1;
read data cor into w = [_n_] org dep;

impvar ads {i in w} = org * s;

max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
/ (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) - (sum{i in w}(dep))^2));
solve;
print s correlation;
quit;

Log message after running the code

96 proc optmodel;
NOTE: Writing HTML Body file: sashtml20.htm
597 set w;
598 number org{w};
599 number ads{w};
600 number dep{w};
601 var s{w} >= 0 <= .8 init 1;
602 read data cor into w = [_n_] org dep;
NOTE: There were 9 observations read from the data set WORK.COR.
603
604 impvar ads {i in w} = org * s;
--- ---
517 614
ERROR 517-782: The name 'ads' is already declared.

ERROR 614-782: The name 'org' is an array.

604! impvar ads {i in w} = org * s;
-
614
ERROR 614-782: The name 's' is an array.

605
606 max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
--- ---
614 614
ERROR 614-782: The name 'ads' is an array.

606! max correlation = (10*sum{i in w}(ads*dep) - (sum{i in w}(ads)) * (sum{i in w}(dep)))
--- ---
614 614
ERROR 614-782: The name 'dep' is an array.
607 / (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) -
--- ---
614 614
ERROR 614-782: The name 'ads' is an array.

607! / (sqrt(10*sum{i in w}(ads^2) - (sum{i in w}(ads))^2) * sqrt(10*sum{i in w}(dep^2) -
---
---
614
614
607! (sum{i in w}(dep))^2));
ERROR 614-782: The name 'dep' is an array.

608 solve;
NOTE: Problem generation will use 4 threads.
NOTE: Previous errors might cause the problem to be resolved incorrectly.
NOTE: The problem has 9 variables (0 free, 0 fixed).
NOTE: The problem has 0 linear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The OPTMODEL presolver is disabled for linear problems.
WARNING: No objective has been specified. A constant zero objective will be used.
NOTE: The problem is a pure network instance. The ALGORITHM=NETWORK option is recommended for
solving problems with this structure.
NOTE: The LP presolver value AUTOMATIC is applied.
NOTE: The LP presolver removed all variables and constraints.
NOTE: Optimal.
NOTE: Objective = 0.
609 print s correlation;
ERROR: The symbol 'correlation' has no value at line 609 column 9.
610 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE OPTMODEL used (Total process time):
real time 1.71 seconds
cpu time 0.79 seconds

Could you please look into that. thanks Ajit

Berliner_Ørsted · Posted 02-26-2018 07:15 AM

Hi,

1) You have declared ads as both a number and an impvar. You probably want to comment out the number statement

2) I don't know if this has changed from the example you mention, but the 'org' and 's' are arrays so you need the suffix org[i] and s[i] for the impvar statement to work.

3) All 'ads' and 'dep' references in the 'correlation' objective need to specify their array index too, i.e. ads[i] and dep[i]

That should help.

deadsea · Posted 02-26-2018 07:36 AM

Thank you so much for the quick solution.