BookmarkSubscribeRSS Feed
jennifer_s_
Calcite | Level 5

I currently have a small dataset of z-scores, around 67 of them.

I have an Excel Solver formula set up, to create a probability for each z-score, according to certain constraints (the main one is that the average PD needs to be a certain value, say 0.08%). I'm also using scale and shift variables on my z-score before converting them to probabilities. 

 

I've used Proc Optmodel before, and happy with how it works, however in this instance, I can't quite get the procedure to output the same probabilities as I'm seeing in Excel Solver. This may be either 1) Correct and SAS is just more accurate and can get to a "better" solution than Excel (likely), or 2) I'm missing something in my sas procedure code. I'm finding the resultant probabilities from the sas procedure see significant "peaks" at certain high z-scores. This makes sense that they are higher, but this pattern is not seen as strongly in the Excel output.

 

Would someone be able to see what I'm potentially missing (or not)! I have attached the Excel Solver which also includes the probabilities seen in both. Any help appreciated!

 

TIA

 

proc optmodel printlevel = 0;
set modelin;
    /* Define parameters (arrays) indexed by xn */
    num zscore{modelin};
    /* Read data from SAS dataset into parameters */
    read data rawdata into modelin = [xn] zscore = zscore;

var a <= 0;
var b >= 0.05;

/* Constraint on average PDs */
con y : sum{i in 1..&count_x.} (cdf("Normal", (a + b*zscore[i])))/&count_x. >= 0.0008;

/* calculating the PD based on the Z score, along with a and b */
min p = sum{i in 1..&count_x.} (cdf("Normal", (a + b*zscore[i])));

solve;
create data sol_data from a b;
run;

data _null_;
set sol_data;
call symput("a", a);
call symput("b", b);
run;
%put &a.;
%put &b.;


/* Apply model */
data model_applied;
set rawdata;
a = &a.*1;
b = &b.*1;
modelled_PD = cdf("Normal", a + (b*zscore))*100;
run;
1 REPLY 1
RobPratt
SAS Super FREQ

It looks like both results are correct, and this is just a case of multiple optimal solutions.  I have several suggestions for you.

 

1. The code does not define count_x, which I assume is 67.  But it is simpler to use the modelin index set instead.  That is, replace 1..&count_x with modelin and replace &count_x with card(modelin).

 

2. Rather than repeat the sum formula, you can first declare p and then use p in the constraint, as follows:

   min p = sum{i in modelin} (cdf("Normal", (a + b*zscore[i])));
   con y : p / card(modelin) >= 0.0008;

3.  The problem is nonconvex, so to discourage getting stuck in a local minimum, you should use the MULTISTART option (alias MS):

   solve with nlp / ms;

4. If you prefer to avoid the spikes, you can use a different objective that minimizes the maximum value:

   min p = sum{i in modelin} cdf("Normal", (a + b*zscore[i]));
   con y : p / card(modelin) = 0.0008;
   min MinMaxObj = max{i in modelin} cdf("Normal", (a + b*zscore[i]));

 

5. To avoid the nondifferentiablility of the max{} operator, you can use an "epigraph" reformulation:

   min p = sum{i in modelin} cdf("Normal", (a + b*zscore[i]));
   con y : p / card(modelin) = 0.0008;
   var MinMaxVar;
   min MinMaxObj = MinMaxVar;
   con MinMaxCon {i in modelin}: MinMaxVar >= cdf("Normal", (a + b*zscore[i]));

 

6. You can avoid the two DATA steps after PROC OPTMODEL by including another CREATE DATA statement:

   create data model_applied from [xn]=modelin zscore a b modelled_PD=(cdf("Normal", a + (b*zscore[xn]))*100);

7. The RUN statement is ignored.  Use QUIT instead.

 

8. A PROC SGPLOT call shows that the minimax approach avoids the spikes:

proc sgplot data=model_applied;
   series x=xn y=zscore;
   series x=xn y=modelled_PD;
   yaxis display=(nolabel);
run;

RobPratt_0-1779300029743.png