SAS Optimization, and SAS Simulation Studio

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-13-2016 04:52 PM

Hello,

This is my first attempt PROC OPTMODEL and would appreciate any help to solve the problem at hand. Through optimization, I am try to determine the optimal cut-off for each bucket such that the error is minmized between avg. PD and the avg. Def. Below are the datasets I have

Dataset 1 | Dataset 2 | |||

Customer | PD | Def | Bucket | |

0001 | 0.24% | 0 | 1 | |

0002 | 0.02% | 0 | 2 | |

0003 | 0.89% | 0 | 3 | |

0004 | 3.50% | 0 | 4 | |

0005 | 6.80% | 1 | 5 | |

0006 | 0.008% | 0 | ||

0007 | 15.20% | 1 | ||

0008 | 28.20% | 1 | ||

0009 | 2.23% | 0 | ||

0010 | 1.25% | 1 |

I want to determine the optimal PD value for each bucket such that 1. The error, difference between avg. PD and avg Def, is minimized for each bucket. 2. The sum of error across buckets is minimized.

Thanks for your help

Accepted Solutions

Solution

09-01-2016
01:36 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-18-2016 04:45 PM - edited 08-19-2016 12:04 AM

It turns out that calling the network solver to compute a shortest path in a directed acyclic network is a bit of overkill. You can instead apply dynamic programming as follows, and this approach uses less memory:

```
proc optmodel;
/* read data */
set OBS;
str obs_id {OBS};
num pd {OBS};
num def {OBS};
num cust_cnt {OBS};
read data have into OBS=[_N_] obs_id=customer pd def cust_cnt;
/* build network */
num dummy_obs init card(OBS)+1;
OBS = OBS union {dummy_obs};
num num_cust_total = sum {i in OBS diff {dummy_obs}} cust_cnt[i];
num num_cust_lb = ceil( &pct_lb*num_cust_total);
num num_cust_ub = floor(&pct_ub*num_cust_total);
set <num,num> OBS_PAIRS init {};
num num_cust_ij {OBS_PAIRS};
num count;
for {i in OBS} do;
count = 0;
for {j in OBS: i < j} do;
count = count + cust_cnt[j-1];
if count < num_cust_lb then continue;
else if count > num_cust_ub then leave;
else do;
OBS_PAIRS = OBS_PAIRS union {<i,j>};
num_cust_ij[i,j] = count;
end;
end;
end;
/* avg_def[i,j] = (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] */
num avg_def {<i,j> in OBS_PAIRS} = (
if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j]
else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j]
);
num error {<i,j> in OBS_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
set OBS_BUCKETS = OBS cross 1..&num_buckets+1;
/* forward pass to compute shortest-path distances */
num dist_from_source {OBS_BUCKETS} init constant('BIG');
dist_from_source[1,1] = 0;
for {b in 1..&num_buckets} do;
put b=;
if b = 1 then do;
for {<(1),j> in OBS_PAIRS} do;
if dist_from_source[j,b+1] > dist_from_source[1,b] + error[1,j] then do;
dist_from_source[j,b+1] = dist_from_source[1,b] + error[1,j];
end;
end;
end;
else if b = &num_buckets then do;
for {<i,(dummy_obs)> in OBS_PAIRS} do;
if dist_from_source[dummy_obs,b+1] > dist_from_source[i,b] + error[i,dummy_obs] then do;
dist_from_source[dummy_obs,b+1] = dist_from_source[i,b] + error[i,dummy_obs];
end;
end;
end;
else do;
for {<i,j> in OBS_PAIRS: i ne 1 and j ne dummy_obs} do;
if dist_from_source[j,b+1] > dist_from_source[i,b] + error[i,j] then do;
dist_from_source[j,b+1] = dist_from_source[i,b] + error[i,j];
end;
end;
end;
end;
/* backward pass to extract shortest path */
set <num,num,num> PATH init {}; /* order, from, to */
num curr_i init dummy_obs;
num curr_j;
for {b in &num_buckets...1 by -1} do;
curr_j = curr_i;
for {<i,(curr_j)> in OBS_PAIRS} do;
if dist_from_source[curr_j,b+1] = dist_from_source[i,b] + error[i,curr_j] then do;
curr_i = i;
leave;
end;
end;
PATH = {<b,curr_i,curr_j>} union PATH;
end;
put dist_from_source[dummy_obs,&num_buckets+1]=;
put PATH;
/* highest pd value in each bucket */
num cutoff {1..&num_buckets};
num avg_def_b {1..&num_buckets};
num error_b {1..&num_buckets};
num num_cust_b {1..&num_buckets};
num pct_cust_b {b in 1..&num_buckets} = num_cust_b[b] / num_cust_total;
set <str> OBS_b {1..&num_buckets};
for {<order,from,to> in PATH} do;
cutoff[order] = pd[to-1];
avg_def_b[order] = avg_def[from,to];
error_b[order] = error[from,to];
num_cust_b[order] = num_cust_ij[from,to];
OBS_b[order] = setof {k in from..to-1} obs_id[k];
end;
print dist_from_source[dummy_obs,&num_buckets+1] cutoff avg_def_b error_b num_cust_b pct_cust_b;
/* observations in each bucket */
/* for {b in 1..&num_buckets} put OBS_b[b]=;*/
quit;
```

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-13-2016 06:59 PM

I'd like to help, but I don't quite understand what you want to do. Can you please provide a sample solution (not necessarily optimal), together with the calculation of the errors?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-13-2016 09:01 PM

Hi Rob,

Sorry for not clearly explaining the problem. The below table shows a sample solution and error calculation:

Bucket | Cut-off Val | Avg. PD | Avg Def |

1 | 0.24% | 0.014% | 0.00% |

2 | 1.25% | 0.57% | 0.00% |

3 | 3.50% | 1.74% | 50% |

4 | 7.00% | 5.15% | 50% |

5 | 30.00% | 21.70% | 100% |

error=(0.014%-0.00%) + (0.57%-0.00%)+(1.74%-50%)+(5.15%-50%)+(21.70%-100%) |

The goal is to optimally determine the cut-off value for each bucket. All customers with PD less than a buckets' cut-off will fall in that bucket. For example, customers 9 and 10 fall in bucket 3 and their average PD is 1.74%. Customer 10 defaulted in that bucket thereby giving it an average default rate of 50%. The cut-off value for each bucket should be determined such that the difference between avg PD and avg Default Rate for each bucket is minimized.

Please do let me know if things are not clear and thanks for taking a look the problem.

Thanks

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-14-2016 05:27 AM

If I understand what you mean, You want divide that table into 5 groups according to PD value(from min to max ) ,and make that error minimize ? Also assuming you want ABS(error) be min not error. Here is my IML code: data have; call streaminit(1234); do customer=1 to 100; pd=rand('uniform'); def=rand('bern',0.5); output; end; format pd percent8.2; run; proc sort data=have;by pd;run; proc iml; use have; read all var {pd def} into have ; close ; start function(x) global(have); xx=t(x); n=nrow(xx); call sort(xx,1); start_end=t(loc(t(xx)^={.}||remove(xx,n))) || t(loc(t(xx)^=remove(xx,1)||{.})); temp=j(nrow(start_end),2,.); do i=1 to nrow(start_end); idx=start_end[i,1]:start_end[i,2]; temp[i,1]=mean(have[idx,1]); temp[i,2]=mean(have[idx,2]); end; obj=abs(sum(temp[,1]-temp[,2])); return (obj); finish; nrow=nrow(have); group=5; /* <--Change it(divide into 5 groups)*/ encoding=j(2,nrow,1); encoding[2,]=group; id=gasetup(2,nrow,123456789); call gasetobj(id,0,"function"); call gasetsel(id,100,1,1); call gainit(id,1000,encoding); niter = 1000; do i = 1 to niter; call garegen(id); call gagetval(value, id); end; call gagetmem(mem, value, id, 1); groups=t(mem); call sort(groups,1); create groups var {groups}; append; close; print value[l = "Min Value:"] ; call gaend(id); quit; data temp; merge groups have; run; proc summary data=temp; by groups; var pd def; output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def; run; proc print noobs;run; OUTPUT: Min Value: 5.5935E-6 GROUPS cut_off_value avg_pd avg_def 1 14.64% 8.98% 0.68421 2 33.94% 24.93% 0.47619 3 61.37% 49.23% 0.36000 4 81.98% 72.48% 0.29412 5 99.59% 92.50% 0.66667

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

08-14-2016 09:42 AM

Thanks for the reply Ksharp. It will take me some time to digest the PROC IML approach. You described the problem quite well. I am trying to form 5 groups based on PD cut-off values. The determination of the cut-off value is driven by the minimization of the difference between avgerage PD and average Def rate for each group and then the minmization of the sum of error for all groups.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

08-14-2016 10:06 AM

@Ksharp, your objective value does not match the solution. I think you need to change abs(sum( to sum(abs( in your calculation of obj.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-15-2016 12:30 AM

Ha. I change OBJ value and got this : data have; call streaminit(1234); do customer=1 to 100; pd=rand('uniform'); def=rand('bern',0.5); output; end; format pd percent8.2; run; proc sort data=have;by pd;run; proc iml; use have; read all var {pd def} into have ; close ; start function(x) global(have,nrow,group,v); xx=t(x); if countunique(xx) ^= group then xx=t(sample(v,nrow)); call sort(xx,1); start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) || t(loc(t(xx)^=remove(xx,1)||{.})); n=nrow(start_end); temp=j(n,2,.); do i=1 to n; idx=start_end[i,1]:start_end[i,2]; temp[i,1]=mean(have[idx,1]); temp[i,2]=mean(have[idx,2]); end; obj=sum(abs(temp[,1]-temp[,2])); return (obj); finish; nrow=nrow(have); group=5; /* <--Change it(divide into 5 groups)*/ v=1:group; encoding=j(2,nrow,1); encoding[2,]=group; id=gasetup(2,nrow,123456789); call gasetobj(id,0,"function"); call gasetsel(id,100,1,1); call gainit(id,10000,encoding); niter = 1000; do i = 1 to niter; call garegen(id); call gagetval(value, id); end; call gagetmem(mem, value, id, 1); groups=t(mem); call sort(groups,1); create groups var {groups}; append; close; print value[l = "Min Value:"] ; call gaend(id); quit; data temp; merge groups have; run; proc summary data=temp; by groups; var pd def; output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def; run; proc print noobs;run; OUTPUT: Min Value: 0.7636106 GROUPS cut_off_value avg_pd avg_def 1 25.07% 13.59% 0.58065 2 48.27% 36.98% 0.35000 3 49.28% 48.96% 0.50000 4 52.11% 50.83% 0.50000 5 99.59% 76.93% 0.48889

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

08-15-2016 09:32 AM

For @Ksharp's random data, the (exact) shortest path approach instantly yields an optimal objective value of 0.46356:

[1] | cutoff | avg_pd_b | avg_def_b | error_b |
---|---|---|---|---|

1 | 0.71595 | 0.34215 | 0.44595 | 0.1037958 |

2 | 0.77749 | 0.75554 | 0.75000 | 0.0055448 |

3 | 0.98974 | 0.89445 | 0.55000 | 0.3444495 |

4 | 0.99431 | 0.99431 | 1.00000 | 0.0056878 |

5 | 0.99592 | 0.99592 | 1.00000 | 0.0040805 |

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-14-2016 10:10 AM

Yes, I understand what you want now. You can model the problem as a shortest path problem and use the network solver in PROC OPTMODEL, as described in the "Bucket Optimization" section of this SAS/OR presentation:

http://support.sas.com/rnd/app/or/papers/INFORMS_2014_Spring.pdf

I'll modify the arc costs to implement your absolute error objective and provide the code soon.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-14-2016 01:51 PM - edited 08-14-2016 09:54 PM

The following code implements the shortest path approach:

```
data have;
input Customer $ PD percent. Def;
datalines;
0001 0.24% 0
0002 0.02% 0
0003 0.89% 0
0004 3.50% 0
0005 6.80% 1
0006 0.008% 0
0007 15.20% 1
0008 28.20% 1
0009 2.23% 0
0010 1.25% 1
;
proc sort data=have;
by pd;
run;
%let num_buckets = 5;
proc optmodel;
set CUSTOMERS;
str customer_id {CUSTOMERS};
num pd {CUSTOMERS};
num def {CUSTOMERS};
read data have into CUSTOMERS=[_N_] customer_id=customer pd def;
num dummy_customer init card(CUSTOMERS)+1;
CUSTOMERS = CUSTOMERS union {dummy_customer};
set CUSTOMER_PAIRS = {i in CUSTOMERS, j in CUSTOMERS: i < j};
num avg_pd {<i,j> in CUSTOMER_PAIRS} =
(sum {k in i..j-1} pd[k]) / (j-i);
num avg_def {<i,j> in CUSTOMER_PAIRS} =
(sum {k in i..j-1} def[k]) / (j-i);
num error {<i,j> in CUSTOMER_PAIRS} =
abs(avg_pd[i,j] - avg_def[i,j]);
set CUSTOMERS_BUCKETS = CUSTOMERS cross 1..&num_buckets+1;
num node_id {CUSTOMERS_BUCKETS};
set NODES = 0..card(CUSTOMERS_BUCKETS)-1;
num node_to_customer {NODES};
num id init 0;
for {<i,b> in CUSTOMERS_BUCKETS} do;
node_id[i,b] = id;
node_to_customer[id] = i;
id = id + 1;
end;
set <num,num> ARCS init {};
num weight {ARCS};
for {<i,j> in CUSTOMER_PAIRS} do;
for {b in 1..&num_buckets} do;
ARCS = ARCS union {<node_id[i,b],node_id[j,b+1]>};
weight[node_id[i,b],node_id[j,b+1]] = error[i,j];
end;
end;
set <num,num,num,num,num> PATHS; /* source, sink, order, from, to */
set SOURCES = {node_id[1,1]};
set SINKS = {node_id[dummy_customer,&num_buckets+1]};
solve with network /
direction = directed
links = (weight=weight)
shortpath = (source=SOURCES sink=SINKS)
out = (sppaths=PATHS)
;
put _NETWORK_OBJECTIVE_=;
put PATHS;
/* highest pd value in each bucket */
num cutoff {1..&num_buckets};
num avg_pd_b {1..&num_buckets};
num avg_def_b {1..&num_buckets};
num error_b {1..&num_buckets};
num customer_from, customer_to;
for {<source, sink, order, from, to> in PATHS} do;
customer_from = node_to_customer[from];
customer_to = node_to_customer[to];
cutoff[order] = pd[customer_to-1];
avg_pd_b[order] = avg_pd[customer_from,customer_to];
avg_def_b[order] = avg_def[customer_from,customer_to];
error_b[order] = error[customer_from,customer_to];
end;
print cutoff avg_pd_b avg_def_b error_b;
/* customers in each bucket */
set <str> CUSTOMERS_b {1..&num_buckets};
for {<source, sink, order, from, to> in PATHS}
CUSTOMERS_b[order] = setof {k in node_to_customer[from]..node_to_customer[to]-1} customer_id[k];
for {b in 1..&num_buckets} put CUSTOMERS_b[b]=;
quit;
```

[1] | cutoff | avg_pd_b | avg_def_b | error_b |
---|---|---|---|---|

1 | 0.00008 | 0.00008 | 0.00000 | 0.00008 |

2 | 0.00020 | 0.00020 | 0.00000 | 0.00020 |

3 | 0.00240 | 0.00240 | 0.00000 | 0.00240 |

4 | 0.00890 | 0.00890 | 0.00000 | 0.00890 |

5 | 0.28200 | 0.09530 | 0.66667 | 0.57137 |

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-14-2016 01:52 PM

Thanks for directing me to your presentation on Bucket Optimization.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-15-2016 09:14 AM

Thank you K_Sharp and Rob for providing the solution to the problem. Since this is my first attempt at optimization, I am still trying to fully comprehend the solutions.

@rob: Out of curiosity, I ran the code on my actual dataset and got out of memory error. The dataset has ~200k observations. Would you expect to see an out of memory error with those many observations.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-15-2016 09:58 AM

Yes, 200k observations would yield about 120 billion arcs, so I am not surprised you ran out of memory. But there are ways to reduce the size of the network.

Are there limits (lower and/or upper) on the number of customers in each bucket?

You might consider pre-binning the customers with the same def value and whose pd values are very close to each other. That is, replace a set of similar customers with a single observation. To correctly compute the averages, you would need to include a new column that indicates the number of customers associated with each observation.

Are you able to share the 200k data set?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobPratt

08-15-2016 12:01 PM

Rob,

Infact limits should be imposed on the number of customers in each bucket, lower bound is 1% and the upper bound is 25%. I am working on reducing the size of the dataset.

Thanks

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to sai_ch

08-16-2016 08:52 AM

@Ksharp By spending a considerable part of yesterday, I think, I understand what the code is doing. Thank you for initiating me into the world of optimization, it is fascinating. One thought, is it possible to tweak the objective function a little bit. Instead of minmizing the difference between the average PD and average Def for a bucket, is it possible to first minimize the distance between each customer and the average Def for a bucket and then to minimize the sum of errors for all the buckets. Essentially, it becomes a double minimization problem.

@RobPratt I was able to reduce the dataset from ~200k to ~16k and will try to implement the Network solution. Thanks for the suggestion.