Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-13-2016 04:52 PM
(3208 views)

Hello,

This is my first attempt PROC OPTMODEL and would appreciate any help to solve the problem at hand. Through optimization, I am try to determine the optimal cut-off for each bucket such that the error is minmized between avg. PD and the avg. Def. Below are the datasets I have

Dataset 1 | Dataset 2 | |||

Customer | PD | Def | Bucket | |

0001 | 0.24% | 0 | 1 | |

0002 | 0.02% | 0 | 2 | |

0003 | 0.89% | 0 | 3 | |

0004 | 3.50% | 0 | 4 | |

0005 | 6.80% | 1 | 5 | |

0006 | 0.008% | 0 | ||

0007 | 15.20% | 1 | ||

0008 | 28.20% | 1 | ||

0009 | 2.23% | 0 | ||

0010 | 1.25% | 1 |

I want to determine the optimal PD value for each bucket such that 1. The error, difference between avg. PD and avg Def, is minimized for each bucket. 2. The sum of error across buckets is minimized.

Thanks for your help

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It turns out that calling the network solver to compute a shortest path in a directed acyclic network is a bit of overkill. You can instead apply dynamic programming as follows, and this approach uses less memory:

```
proc optmodel;
/* read data */
set OBS;
str obs_id {OBS};
num pd {OBS};
num def {OBS};
num cust_cnt {OBS};
read data have into OBS=[_N_] obs_id=customer pd def cust_cnt;
/* build network */
num dummy_obs init card(OBS)+1;
OBS = OBS union {dummy_obs};
num num_cust_total = sum {i in OBS diff {dummy_obs}} cust_cnt[i];
num num_cust_lb = ceil( &pct_lb*num_cust_total);
num num_cust_ub = floor(&pct_ub*num_cust_total);
set <num,num> OBS_PAIRS init {};
num num_cust_ij {OBS_PAIRS};
num count;
for {i in OBS} do;
count = 0;
for {j in OBS: i < j} do;
count = count + cust_cnt[j-1];
if count < num_cust_lb then continue;
else if count > num_cust_ub then leave;
else do;
OBS_PAIRS = OBS_PAIRS union {<i,j>};
num_cust_ij[i,j] = count;
end;
end;
end;
/* avg_def[i,j] = (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] */
num avg_def {<i,j> in OBS_PAIRS} = (
if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j]
else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j]
);
num error {<i,j> in OBS_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
set OBS_BUCKETS = OBS cross 1..&num_buckets+1;
/* forward pass to compute shortest-path distances */
num dist_from_source {OBS_BUCKETS} init constant('BIG');
dist_from_source[1,1] = 0;
for {b in 1..&num_buckets} do;
put b=;
if b = 1 then do;
for {<(1),j> in OBS_PAIRS} do;
if dist_from_source[j,b+1] > dist_from_source[1,b] + error[1,j] then do;
dist_from_source[j,b+1] = dist_from_source[1,b] + error[1,j];
end;
end;
end;
else if b = &num_buckets then do;
for {<i,(dummy_obs)> in OBS_PAIRS} do;
if dist_from_source[dummy_obs,b+1] > dist_from_source[i,b] + error[i,dummy_obs] then do;
dist_from_source[dummy_obs,b+1] = dist_from_source[i,b] + error[i,dummy_obs];
end;
end;
end;
else do;
for {<i,j> in OBS_PAIRS: i ne 1 and j ne dummy_obs} do;
if dist_from_source[j,b+1] > dist_from_source[i,b] + error[i,j] then do;
dist_from_source[j,b+1] = dist_from_source[i,b] + error[i,j];
end;
end;
end;
end;
/* backward pass to extract shortest path */
set <num,num,num> PATH init {}; /* order, from, to */
num curr_i init dummy_obs;
num curr_j;
for {b in &num_buckets...1 by -1} do;
curr_j = curr_i;
for {<i,(curr_j)> in OBS_PAIRS} do;
if dist_from_source[curr_j,b+1] = dist_from_source[i,b] + error[i,curr_j] then do;
curr_i = i;
leave;
end;
end;
PATH = {<b,curr_i,curr_j>} union PATH;
end;
put dist_from_source[dummy_obs,&num_buckets+1]=;
put PATH;
/* highest pd value in each bucket */
num cutoff {1..&num_buckets};
num avg_def_b {1..&num_buckets};
num error_b {1..&num_buckets};
num num_cust_b {1..&num_buckets};
num pct_cust_b {b in 1..&num_buckets} = num_cust_b[b] / num_cust_total;
set <str> OBS_b {1..&num_buckets};
for {<order,from,to> in PATH} do;
cutoff[order] = pd[to-1];
avg_def_b[order] = avg_def[from,to];
error_b[order] = error[from,to];
num_cust_b[order] = num_cust_ij[from,to];
OBS_b[order] = setof {k in from..to-1} obs_id[k];
end;
print dist_from_source[dummy_obs,&num_buckets+1] cutoff avg_def_b error_b num_cust_b pct_cust_b;
/* observations in each bucket */
/* for {b in 1..&num_buckets} put OBS_b[b]=;*/
quit;
```

40 REPLIES 40

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Rob,

Sorry for not clearly explaining the problem. The below table shows a sample solution and error calculation:

Bucket | Cut-off Val | Avg. PD | Avg Def |

1 | 0.24% | 0.014% | 0.00% |

2 | 1.25% | 0.57% | 0.00% |

3 | 3.50% | 1.74% | 50% |

4 | 7.00% | 5.15% | 50% |

5 | 30.00% | 21.70% | 100% |

error=(0.014%-0.00%) + (0.57%-0.00%)+(1.74%-50%)+(5.15%-50%)+(21.70%-100%) |

The goal is to optimally determine the cut-off value for each bucket. All customers with PD less than a buckets' cut-off will fall in that bucket. For example, customers 9 and 10 fall in bucket 3 and their average PD is 1.74%. Customer 10 defaulted in that bucket thereby giving it an average default rate of 50%. The cut-off value for each bucket should be determined such that the difference between avg PD and avg Default Rate for each bucket is minimized.

Please do let me know if things are not clear and thanks for taking a look the problem.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For @Ksharp's random data, the (exact) shortest path approach instantly yields an optimal objective value of 0.46356:

[1] | cutoff | avg_pd_b | avg_def_b | error_b |
---|---|---|---|---|

1 | 0.71595 | 0.34215 | 0.44595 | 0.1037958 |

2 | 0.77749 | 0.75554 | 0.75000 | 0.0055448 |

3 | 0.98974 | 0.89445 | 0.55000 | 0.3444495 |

4 | 0.99431 | 0.99431 | 1.00000 | 0.0056878 |

5 | 0.99592 | 0.99592 | 1.00000 | 0.0040805 |

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes, I understand what you want now. You can model the problem as a shortest path problem and use the network solver in PROC OPTMODEL, as described in the "Bucket Optimization" section of this SAS/OR presentation:

http://support.sas.com/rnd/app/or/papers/INFORMS_2014_Spring.pdf

I'll modify the arc costs to implement your absolute error objective and provide the code soon.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The following code implements the shortest path approach:

```
data have;
input Customer $ PD percent. Def;
datalines;
0001 0.24% 0
0002 0.02% 0
0003 0.89% 0
0004 3.50% 0
0005 6.80% 1
0006 0.008% 0
0007 15.20% 1
0008 28.20% 1
0009 2.23% 0
0010 1.25% 1
;
proc sort data=have;
by pd;
run;
%let num_buckets = 5;
proc optmodel;
set CUSTOMERS;
str customer_id {CUSTOMERS};
num pd {CUSTOMERS};
num def {CUSTOMERS};
read data have into CUSTOMERS=[_N_] customer_id=customer pd def;
num dummy_customer init card(CUSTOMERS)+1;
CUSTOMERS = CUSTOMERS union {dummy_customer};
set CUSTOMER_PAIRS = {i in CUSTOMERS, j in CUSTOMERS: i < j};
num avg_pd {<i,j> in CUSTOMER_PAIRS} =
(sum {k in i..j-1} pd[k]) / (j-i);
num avg_def {<i,j> in CUSTOMER_PAIRS} =
(sum {k in i..j-1} def[k]) / (j-i);
num error {<i,j> in CUSTOMER_PAIRS} =
abs(avg_pd[i,j] - avg_def[i,j]);
set CUSTOMERS_BUCKETS = CUSTOMERS cross 1..&num_buckets+1;
num node_id {CUSTOMERS_BUCKETS};
set NODES = 0..card(CUSTOMERS_BUCKETS)-1;
num node_to_customer {NODES};
num id init 0;
for {<i,b> in CUSTOMERS_BUCKETS} do;
node_id[i,b] = id;
node_to_customer[id] = i;
id = id + 1;
end;
set <num,num> ARCS init {};
num weight {ARCS};
for {<i,j> in CUSTOMER_PAIRS} do;
for {b in 1..&num_buckets} do;
ARCS = ARCS union {<node_id[i,b],node_id[j,b+1]>};
weight[node_id[i,b],node_id[j,b+1]] = error[i,j];
end;
end;
set <num,num,num,num,num> PATHS; /* source, sink, order, from, to */
set SOURCES = {node_id[1,1]};
set SINKS = {node_id[dummy_customer,&num_buckets+1]};
solve with network /
direction = directed
links = (weight=weight)
shortpath = (source=SOURCES sink=SINKS)
out = (sppaths=PATHS)
;
put _NETWORK_OBJECTIVE_=;
put PATHS;
/* highest pd value in each bucket */
num cutoff {1..&num_buckets};
num avg_pd_b {1..&num_buckets};
num avg_def_b {1..&num_buckets};
num error_b {1..&num_buckets};
num customer_from, customer_to;
for {<source, sink, order, from, to> in PATHS} do;
customer_from = node_to_customer[from];
customer_to = node_to_customer[to];
cutoff[order] = pd[customer_to-1];
avg_pd_b[order] = avg_pd[customer_from,customer_to];
avg_def_b[order] = avg_def[customer_from,customer_to];
error_b[order] = error[customer_from,customer_to];
end;
print cutoff avg_pd_b avg_def_b error_b;
/* customers in each bucket */
set <str> CUSTOMERS_b {1..&num_buckets};
for {<source, sink, order, from, to> in PATHS}
CUSTOMERS_b[order] = setof {k in node_to_customer[from]..node_to_customer[to]-1} customer_id[k];
for {b in 1..&num_buckets} put CUSTOMERS_b[b]=;
quit;
```

[1] | cutoff | avg_pd_b | avg_def_b | error_b |
---|---|---|---|---|

1 | 0.00008 | 0.00008 | 0.00000 | 0.00008 |

2 | 0.00020 | 0.00020 | 0.00000 | 0.00020 |

3 | 0.00240 | 0.00240 | 0.00000 | 0.00240 |

4 | 0.00890 | 0.00890 | 0.00000 | 0.00890 |

5 | 0.28200 | 0.09530 | 0.66667 | 0.57137 |

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for directing me to your presentation on Bucket Optimization.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you K_Sharp and Rob for providing the solution to the problem. Since this is my first attempt at optimization, I am still trying to fully comprehend the solutions.

@rob: Out of curiosity, I ran the code on my actual dataset and got out of memory error. The dataset has ~200k observations. Would you expect to see an out of memory error with those many observations.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes, 200k observations would yield about 120 billion arcs, so I am not surprised you ran out of memory. But there are ways to reduce the size of the network.

Are there limits (lower and/or upper) on the number of customers in each bucket?

You might consider pre-binning the customers with the same def value and whose pd values are very close to each other. That is, replace a set of similar customers with a single observation. To correctly compute the averages, you would need to include a new column that indicates the number of customers associated with each observation.

Are you able to share the 200k data set?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rob,

Infact limits should be imposed on the number of customers in each bucket, lower bound is 1% and the upper bound is 25%. I am working on reducing the size of the dataset.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Ksharp By spending a considerable part of yesterday, I think, I understand what the code is doing. Thank you for initiating me into the world of optimization, it is fascinating. One thought, is it possible to tweak the objective function a little bit. Instead of minmizing the difference between the average PD and average Def for a bucket, is it possible to first minimize the distance between each customer and the average Def for a bucket and then to minimize the sum of errors for all the buckets. Essentially, it becomes a double minimization problem.

@RobPratt I was able to reduce the dataset from ~200k to ~16k and will try to implement the Network solution. Thanks for the suggestion.

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.