Solved: Finding the optimal cut-off for each bucket

sai_ch · Posted 08-13-2016 04:52 PM

Hello,

This is my first attempt PROC OPTMODEL and would appreciate any help to solve the problem at hand. Through optimization, I am try to determine the optimal cut-off for each bucket such that the error is minmized between avg. PD and the avg. Def. Below are the datasets I have

Dataset 1			Dataset 2
Customer	PD	Def	Bucket
0001	0.24%	0	1
0002	0.02%	0	2
0003	0.89%	0	3
0004	3.50%	0	4
0005	6.80%	1	5
0006	0.008%	0
0007	15.20%	1
0008	28.20%	1
0009	2.23%	0
0010	1.25%	1

I want to determine the optimal PD value for each bucket such that 1. The error, difference between avg. PD and avg Def, is minimized for each bucket. 2. The sum of error across buckets is minimized.

Thanks for your help

RobPratt · Posted 08-18-2016 04:45 PM

It turns out that calling the network solver to compute a shortest path in a directed acyclic network is a bit of overkill. You can instead apply dynamic programming as follows, and this approach uses less memory:

proc optmodel;
   /* read data */
   set OBS;
   str obs_id {OBS};
   num pd {OBS};
   num def {OBS};
   num cust_cnt {OBS};
   read data have into OBS=[_N_] obs_id=customer pd def cust_cnt;

   /* build network */
   num dummy_obs init card(OBS)+1;
   OBS = OBS union {dummy_obs};
   num num_cust_total = sum {i in OBS diff {dummy_obs}} cust_cnt[i];
   num num_cust_lb = ceil( &pct_lb*num_cust_total);
   num num_cust_ub = floor(&pct_ub*num_cust_total);
   set <num,num> OBS_PAIRS init {};
   num num_cust_ij {OBS_PAIRS};
   num count;
   for {i in OBS} do;
      count = 0;
      for {j in OBS: i < j} do;
         count = count + cust_cnt[j-1];
         if count < num_cust_lb then continue;
         else if count > num_cust_ub then leave;
         else do;
            OBS_PAIRS = OBS_PAIRS union {<i,j>};
            num_cust_ij[i,j] = count;
         end;
      end;
   end;
   /* avg_def[i,j] = (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] */
   num avg_def {<i,j> in OBS_PAIRS} = (
      if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] 
      else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j] 
   );
   num error {<i,j> in OBS_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
   set OBS_BUCKETS = OBS cross 1..&num_buckets+1;

   /* forward pass to compute shortest-path distances */
   num dist_from_source {OBS_BUCKETS} init constant('BIG');
   dist_from_source[1,1] = 0;
   for {b in 1..&num_buckets} do;
      put b=;
      if b = 1 then do;
         for {<(1),j> in OBS_PAIRS} do;
            if dist_from_source[j,b+1] > dist_from_source[1,b] + error[1,j] then do;
               dist_from_source[j,b+1] = dist_from_source[1,b] + error[1,j];
            end;
         end;
      end;
      else if b = &num_buckets then do;
         for {<i,(dummy_obs)> in OBS_PAIRS} do;
            if dist_from_source[dummy_obs,b+1] > dist_from_source[i,b] + error[i,dummy_obs] then do;
               dist_from_source[dummy_obs,b+1] = dist_from_source[i,b] + error[i,dummy_obs];
            end;
         end;
      end;
      else do;
         for {<i,j> in OBS_PAIRS: i ne 1 and j ne dummy_obs} do;
            if dist_from_source[j,b+1] > dist_from_source[i,b] + error[i,j] then do;
               dist_from_source[j,b+1] = dist_from_source[i,b] + error[i,j];
            end;
         end;
      end;
   end;

   /* backward pass to extract shortest path */
   set <num,num,num> PATH init {}; /* order, from, to */
   num curr_i init dummy_obs;
   num curr_j;
   for {b in &num_buckets...1 by -1} do;
      curr_j = curr_i;
      for {<i,(curr_j)> in OBS_PAIRS} do;
         if dist_from_source[curr_j,b+1] = dist_from_source[i,b] + error[i,curr_j] then do;
            curr_i = i;
            leave;
         end;
      end;
      PATH = {<b,curr_i,curr_j>} union PATH;
   end;
   put dist_from_source[dummy_obs,&num_buckets+1]=;
   put PATH;

   /* highest pd value in each bucket */
   num cutoff      {1..&num_buckets};
   num avg_def_b   {1..&num_buckets};
   num error_b     {1..&num_buckets};
   num num_cust_b  {1..&num_buckets};
   num pct_cust_b  {b in 1..&num_buckets} = num_cust_b[b] / num_cust_total;
   set <str> OBS_b {1..&num_buckets};
   for {<order,from,to> in PATH} do;
      cutoff[order]     = pd[to-1];
      avg_def_b[order]  = avg_def[from,to];
      error_b[order]    = error[from,to];
      num_cust_b[order] = num_cust_ij[from,to];
      OBS_b[order]      = setof {k in from..to-1} obs_id[k];
   end;
   print dist_from_source[dummy_obs,&num_buckets+1] cutoff avg_def_b error_b num_cust_b pct_cust_b;

   /* observations in each bucket */
/*   for {b in 1..&num_buckets} put OBS_b[b]=;*/
quit;

View solution in original post

RobPratt · Posted 08-13-2016 06:59 PM

I'd like to help, but I don't quite understand what you want to do. Can you please provide a sample solution (not necessarily optimal), together with the calculation of the errors?

sai_ch · Posted 08-13-2016 09:01 PM

Hi Rob,

Sorry for not clearly explaining the problem. The below table shows a sample solution and error calculation:

Bucket	Cut-off Val	Avg. PD	Avg Def
1	0.24%	0.014%	0.00%
2	1.25%	0.57%	0.00%
3	3.50%	1.74%	50%
4	7.00%	5.15%	50%
5	30.00%	21.70%	100%

error=(0.014%-0.00%) + (0.57%-0.00%)+(1.74%-50%)+(5.15%-50%)+(21.70%-100%)

The goal is to optimally determine the cut-off value for each bucket. All customers with PD less than a buckets' cut-off will fall in that bucket. For example, customers 9 and 10 fall in bucket 3 and their average PD is 1.74%. Customer 10 defaulted in that bucket thereby giving it an average default rate of 50%. The cut-off value for each bucket should be determined such that the difference between avg PD and avg Default Rate for each bucket is minimized.

Please do let me know if things are not clear and thanks for taking a look the problem.

Thanks

Ksharp · Posted 08-14-2016 05:27 AM

If I understand what you mean, You want divide that table into 5 groups
according to PD value(from min to  max ) ,and make that error minimize ?
Also assuming you want ABS(error) be min not error.
Here is my IML code:


data have;
call streaminit(1234);
 do customer=1 to 100;
  pd=rand('uniform');
  def=rand('bern',0.5);
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;

proc iml;
use have;
read all var {pd def} into have ;
close ;

start function(x) global(have);
 xx=t(x);
 n=nrow(xx);
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,n))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));

 temp=j(nrow(start_end),2,.);
 do i=1 to nrow(start_end);
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=mean(have[idx,1]);
  temp[i,2]=mean(have[idx,2]);
 end;
 obj=abs(sum(temp[,1]-temp[,2]));
return (obj);
finish;

nrow=nrow(have);
group=5;  /* <--Change it(divide into 5 groups)*/

encoding=j(2,nrow,1);
encoding[2,]=group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,100,1,1);
call gainit(id,1000,encoding);


niter = 1000;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def;
run;
proc print noobs;run;




OUTPUT:


Min Value:
5.5935E-6
GROUPS	cut_off_value	avg_pd	avg_def
1	14.64%	8.98%	0.68421
2	33.94%	24.93%	0.47619
3	61.37%	49.23%	0.36000
4	81.98%	72.48%	0.29412
5	99.59%	92.50%	0.66667

sai_ch · Posted 08-14-2016 09:42 AM

Thanks for the reply Ksharp. It will take me some time to digest the PROC IML approach. You described the problem quite well. I am trying to form 5 groups based on PD cut-off values. The determination of the cut-off value is driven by the minimization of the difference between avgerage PD and average Def rate for each group and then the minmization of the sum of error for all groups.

RobPratt · Posted 08-14-2016 10:06 AM

@Ksharp, your objective value does not match the solution. I think you need to change abs(sum( to sum(abs( in your calculation of obj.

Ksharp · Posted 08-15-2016 12:30 AM

Ha. I change OBJ value and got this :

data have;
call streaminit(1234);
 do customer=1 to 100;
  pd=rand('uniform');
  def=rand('bern',0.5);
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;

proc iml;
use have;
read all var {pd def} into have ;
close ;

start function(x) global(have,nrow,group,v);
 xx=t(x);
 if countunique(xx) ^= group then xx=t(sample(v,nrow));
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 n=nrow(start_end);
 temp=j(n,2,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=mean(have[idx,1]);
  temp[i,2]=mean(have[idx,2]);
 end;
 obj=sum(abs(temp[,1]-temp[,2]));
return (obj);
finish;

nrow=nrow(have);
group=5;  /* <--Change it(divide into 5 groups)*/

v=1:group;
encoding=j(2,nrow,1);
encoding[2,]=group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,100,1,1);
call gainit(id,10000,encoding);


niter = 1000;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def;
run;
proc print noobs;run;




OUTPUT:

Min Value:
0.7636106
GROUPS	cut_off_value	avg_pd	avg_def
1	25.07%	13.59%	0.58065
2	48.27%	36.98%	0.35000
3	49.28%	48.96%	0.50000
4	52.11%	50.83%	0.50000
5	99.59%	76.93%	0.48889

RobPratt · Posted 08-15-2016 09:32 AM

For @Ksharp's random data, the (exact) shortest path approach instantly yields an optimal objective value of 0.46356:

[1]	cutoff	avg_pd_b	avg_def_b	error_b
1	0.71595	0.34215	0.44595	0.1037958
2	0.77749	0.75554	0.75000	0.0055448
3	0.98974	0.89445	0.55000	0.3444495
4	0.99431	0.99431	1.00000	0.0056878
5	0.99592	0.99592	1.00000	0.0040805

RobPratt · Posted 08-14-2016 10:10 AM

Yes, I understand what you want now. You can model the problem as a shortest path problem and use the network solver in PROC OPTMODEL, as described in the "Bucket Optimization" section of this SAS/OR presentation:

http://support.sas.com/rnd/app/or/papers/INFORMS_2014_Spring.pdf

I'll modify the arc costs to implement your absolute error objective and provide the code soon.

RobPratt · Posted 08-14-2016 01:51 PM

The following code implements the shortest path approach:

data have;
   input Customer $ PD percent. Def;
   datalines;
0001 0.24% 0
0002 0.02% 0
0003 0.89% 0
0004 3.50% 0
0005 6.80% 1
0006 0.008% 0
0007 15.20% 1
0008 28.20% 1
0009 2.23% 0
0010 1.25% 1
; 

proc sort data=have;
   by pd;
run;

%let num_buckets = 5;
proc optmodel;
   set CUSTOMERS;
   str customer_id {CUSTOMERS};
   num pd {CUSTOMERS};
   num def {CUSTOMERS};
   read data have into CUSTOMERS=[_N_] customer_id=customer pd def;

   num dummy_customer init card(CUSTOMERS)+1;
   CUSTOMERS = CUSTOMERS union {dummy_customer};
   set CUSTOMER_PAIRS = {i in CUSTOMERS, j in CUSTOMERS: i < j};
   num avg_pd {<i,j> in CUSTOMER_PAIRS} = 
      (sum {k in i..j-1} pd[k]) / (j-i);
   num avg_def {<i,j> in CUSTOMER_PAIRS} = 
      (sum {k in i..j-1} def[k]) / (j-i);
   num error {<i,j> in CUSTOMER_PAIRS} = 
      abs(avg_pd[i,j] - avg_def[i,j]);
   set CUSTOMERS_BUCKETS = CUSTOMERS cross 1..&num_buckets+1;
   num node_id {CUSTOMERS_BUCKETS};
   set NODES = 0..card(CUSTOMERS_BUCKETS)-1;
   num node_to_customer {NODES};
   num id init 0;
   for {<i,b> in CUSTOMERS_BUCKETS} do;
      node_id[i,b] = id;
      node_to_customer[id] = i;
      id = id + 1;
   end;
   set <num,num> ARCS init {};
   num weight {ARCS};
   for {<i,j> in CUSTOMER_PAIRS} do;
      for {b in 1..&num_buckets} do;
         ARCS = ARCS union {<node_id[i,b],node_id[j,b+1]>};
         weight[node_id[i,b],node_id[j,b+1]] = error[i,j];
      end;
   end;

   set <num,num,num,num,num> PATHS; /* source, sink, order, from, to */
   set SOURCES = {node_id[1,1]};
   set SINKS   = {node_id[dummy_customer,&num_buckets+1]};

   solve with network /
      direction = directed
      links     = (weight=weight)
      shortpath = (source=SOURCES sink=SINKS)
      out       = (sppaths=PATHS)
   ;
   put _NETWORK_OBJECTIVE_=;
   put PATHS;

   /* highest pd value in each bucket */
   num cutoff    {1..&num_buckets};
   num avg_pd_b  {1..&num_buckets};
   num avg_def_b {1..&num_buckets};
   num error_b   {1..&num_buckets};
   num customer_from, customer_to;
   for {<source, sink, order, from, to> in PATHS} do;
      customer_from = node_to_customer[from];
      customer_to   = node_to_customer[to];
      cutoff[order] = pd[customer_to-1];
      avg_pd_b[order]  = avg_pd[customer_from,customer_to];
      avg_def_b[order] = avg_def[customer_from,customer_to];
      error_b[order]   = error[customer_from,customer_to];
   end;
   print cutoff avg_pd_b avg_def_b error_b;

   /* customers in each bucket */
   set <str> CUSTOMERS_b {1..&num_buckets};
   for {<source, sink, order, from, to> in PATHS}
      CUSTOMERS_b[order] = setof {k in node_to_customer[from]..node_to_customer[to]-1} customer_id[k];
   for {b in 1..&num_buckets} put CUSTOMERS_b[b]=;
quit;

[1]	cutoff	avg_pd_b	avg_def_b	error_b
1	0.00008	0.00008	0.00000	0.00008
2	0.00020	0.00020	0.00000	0.00020
3	0.00240	0.00240	0.00000	0.00240
4	0.00890	0.00890	0.00000	0.00890
5	0.28200	0.09530	0.66667	0.57137

sai_ch · Posted 08-14-2016 01:52 PM

Thanks for directing me to your presentation on Bucket Optimization.

sai_ch · Posted 08-15-2016 09:14 AM

Thank you K_Sharp and Rob for providing the solution to the problem. Since this is my first attempt at optimization, I am still trying to fully comprehend the solutions.

@rob: Out of curiosity, I ran the code on my actual dataset and got out of memory error. The dataset has ~200k observations. Would you expect to see an out of memory error with those many observations.

RobPratt · Posted 08-15-2016 09:58 AM

Yes, 200k observations would yield about 120 billion arcs, so I am not surprised you ran out of memory. But there are ways to reduce the size of the network.

Are there limits (lower and/or upper) on the number of customers in each bucket?

You might consider pre-binning the customers with the same def value and whose pd values are very close to each other. That is, replace a set of similar customers with a single observation. To correctly compute the averages, you would need to include a new column that indicates the number of customers associated with each observation.

Are you able to share the 200k data set?

sai_ch · Posted 08-15-2016 12:01 PM

Rob,

Infact limits should be imposed on the number of customers in each bucket, lower bound is 1% and the upper bound is 25%. I am working on reducing the size of the dataset.

Thanks

sai_ch · Posted 08-16-2016 08:52 AM

@Ksharp By spending a considerable part of yesterday, I think, I understand what the code is doing. Thank you for initiating me into the world of optimization, it is fascinating. One thought, is it possible to tweak the objective function a little bit. Instead of minmizing the difference between the average PD and average Def for a bucket, is it possible to first minimize the distance between each customer and the average Def for a bucket and then to minimize the sum of errors for all the buckets. Essentially, it becomes a double minimization problem.

@RobPratt I was able to reduce the dataset from ~200k to ~16k and will try to implement the Network solution. Thanks for the suggestion.

Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket

Re: Finding the optimal cut-off for each bucket