BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RobPratt
SAS Super FREQ

The following PROC OPTMODEL code reads the additional cust_cnt column, handles the lower and upper limits on customers per bucket, and minimizes your new objective.  It also improves efficiency in several places, for example by recursively computing avg_def.

 

data have;
call streaminit(1234);
 do c=1 to 99;
  customer = put(c,z4.);
  pd=rand('uniform');
  def=rand('bern',0.5);
  cust_cnt=ceil(10*rand('uniform'));
  output;
 end;
 format pd percent8.2;
run;

proc sort data=have;
   by pd;
run;

%let num_buckets = 5;
%let pct_lb      = 0.01;
%let pct_ub      = 0.25;
proc optmodel;
   /* read data */
   set OBS;
   str obs_id {OBS};
   num pd {OBS};
   num def {OBS};
   num cust_cnt {OBS};
   read data have into OBS=[_N_] obs_id=customer pd def cust_cnt;

   /* build network */
   num dummy_obs init card(OBS)+1;
   OBS = OBS union {dummy_obs};
   num num_cust_total = sum {i in OBS diff {dummy_obs}} cust_cnt[i];
   num num_cust_lb = ceil( &pct_lb*num_cust_total);
   num num_cust_ub = floor(&pct_ub*num_cust_total);
   set <num,num> OBS_PAIRS init {};
   num num_cust_ij {OBS_PAIRS};
   num count;
   for {i in OBS} do;
      count = 0;
      for {j in OBS: i < j} do;
         count = count + cust_cnt[j-1];
         if count < num_cust_lb then continue;
         else if count > num_cust_ub then leave;
         else do;
            OBS_PAIRS = OBS_PAIRS union {<i,j>};
            num_cust_ij[i,j] = count;
         end;
      end;
   end;
   /* avg_def[i,j] = (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] */
   num avg_def {<i,j> in OBS_PAIRS} = (
      if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] 
      else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j] 
   );
   num error {<i,j> in OBS_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
   set OBS_BUCKETS = OBS cross 1..&num_buckets+1;
   num node_id {OBS_BUCKETS};
   set NODES = 0..card(OBS_BUCKETS)-1;
   num node_to_obs {NODES};
   num id init 0;
   for {<i,b> in OBS_BUCKETS} do;
      node_id[i,b] = id;
      node_to_obs[id] = i;
      id = id + 1;
   end;
   set ARCS = 
      setof {<(1),j> in OBS_PAIRS, b in {1}} <node_id[1,b],node_id[j,b+1]>
      union
      setof {<i,j> in OBS_PAIRS, b in 2..&num_buckets-1: i ne 1 and j ne dummy_obs} <node_id[i,b],node_id[j,b+1]>
      union
      setof {<i,(dummy_obs)> in OBS_PAIRS, b in {&num_buckets}} <node_id[i,b],node_id[dummy_obs,b+1]>;
   num weight {<from,to> in ARCS} = error[node_to_obs[from],node_to_obs[to]];

   set <num,num,num,num,num> PATHS; /* source, sink, order, from, to */
   set SOURCES = {node_id[1,1]};
   set SINKS   = {node_id[dummy_obs,&num_buckets+1]};

   /* call network solver using shortest path algorithm */
   solve with network /
      direction = directed
      links     = (weight=weight)
      shortpath = (source=SOURCES sink=SINKS)
      out       = (sppaths=PATHS)
   ;
   put _NETWORK_OBJECTIVE_=;
   put PATHS;

   /* highest pd value in each bucket */
   num cutoff      {1..&num_buckets};
   num avg_def_b   {1..&num_buckets};
   num error_b     {1..&num_buckets};
   num num_cust_b  {1..&num_buckets};
   num pct_cust_b  {b in 1..&num_buckets} = num_cust_b[b] / num_cust_total;
   set <str> OBS_b {1..&num_buckets};
   num obs_from, obs_to;
   for {<source, sink, order, from, to> in PATHS} do;
      obs_from          = node_to_obs[from];
      obs_to            = node_to_obs[to];
      cutoff[order]     = pd[obs_to-1];
      avg_def_b[order]  = avg_def[obs_from,obs_to];
      error_b[order]    = error[obs_from,obs_to];
      num_cust_b[order] = num_cust_ij[obs_from,obs_to];
      OBS_b[order]      = setof {k in node_to_obs[from]..node_to_obs[to]-1} obs_id[k];
   end;
   print _NETWORK_OBJECTIVE_ cutoff avg_def_b error_b num_cust_b pct_cust_b;

   /* observations in each bucket */
/*   for {b in 1..&num_buckets} put OBS_b[b]=;*/
quit;

SAS Output

_NETWORK_OBJECTIVE_
23.104

[1] cutoff avg_def_b error_b num_cust_b pct_cust_b
1 0.22811 0.62016 12.4959 129 0.24157
2 0.48268 0.39231 1.7456 130 0.24345
3 0.71467 0.54054 1.4654 111 0.20787
4 0.94800 0.63025 4.6940 119 0.22285
5 0.99839 0.53333 2.7026 45 0.08427

 

RobPratt
SAS Super FREQ

It turns out that calling the network solver to compute a shortest path in a directed acyclic network is a bit of overkill.  You can instead apply dynamic programming as follows, and this approach uses less memory:

 

proc optmodel;
   /* read data */
   set OBS;
   str obs_id {OBS};
   num pd {OBS};
   num def {OBS};
   num cust_cnt {OBS};
   read data have into OBS=[_N_] obs_id=customer pd def cust_cnt;

   /* build network */
   num dummy_obs init card(OBS)+1;
   OBS = OBS union {dummy_obs};
   num num_cust_total = sum {i in OBS diff {dummy_obs}} cust_cnt[i];
   num num_cust_lb = ceil( &pct_lb*num_cust_total);
   num num_cust_ub = floor(&pct_ub*num_cust_total);
   set <num,num> OBS_PAIRS init {};
   num num_cust_ij {OBS_PAIRS};
   num count;
   for {i in OBS} do;
      count = 0;
      for {j in OBS: i < j} do;
         count = count + cust_cnt[j-1];
         if count < num_cust_lb then continue;
         else if count > num_cust_ub then leave;
         else do;
            OBS_PAIRS = OBS_PAIRS union {<i,j>};
            num_cust_ij[i,j] = count;
         end;
      end;
   end;
   /* avg_def[i,j] = (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] */
   num avg_def {<i,j> in OBS_PAIRS} = (
      if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j] 
      else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j] 
   );
   num error {<i,j> in OBS_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
   set OBS_BUCKETS = OBS cross 1..&num_buckets+1;

   /* forward pass to compute shortest-path distances */
   num dist_from_source {OBS_BUCKETS} init constant('BIG');
   dist_from_source[1,1] = 0;
   for {b in 1..&num_buckets} do;
      put b=;
      if b = 1 then do;
         for {<(1),j> in OBS_PAIRS} do;
            if dist_from_source[j,b+1] > dist_from_source[1,b] + error[1,j] then do;
               dist_from_source[j,b+1] = dist_from_source[1,b] + error[1,j];
            end;
         end;
      end;
      else if b = &num_buckets then do;
         for {<i,(dummy_obs)> in OBS_PAIRS} do;
            if dist_from_source[dummy_obs,b+1] > dist_from_source[i,b] + error[i,dummy_obs] then do;
               dist_from_source[dummy_obs,b+1] = dist_from_source[i,b] + error[i,dummy_obs];
            end;
         end;
      end;
      else do;
         for {<i,j> in OBS_PAIRS: i ne 1 and j ne dummy_obs} do;
            if dist_from_source[j,b+1] > dist_from_source[i,b] + error[i,j] then do;
               dist_from_source[j,b+1] = dist_from_source[i,b] + error[i,j];
            end;
         end;
      end;
   end;

   /* backward pass to extract shortest path */
   set <num,num,num> PATH init {}; /* order, from, to */
   num curr_i init dummy_obs;
   num curr_j;
   for {b in &num_buckets...1 by -1} do;
      curr_j = curr_i;
      for {<i,(curr_j)> in OBS_PAIRS} do;
         if dist_from_source[curr_j,b+1] = dist_from_source[i,b] + error[i,curr_j] then do;
            curr_i = i;
            leave;
         end;
      end;
      PATH = {<b,curr_i,curr_j>} union PATH;
   end;
   put dist_from_source[dummy_obs,&num_buckets+1]=;
   put PATH;

   /* highest pd value in each bucket */
   num cutoff      {1..&num_buckets};
   num avg_def_b   {1..&num_buckets};
   num error_b     {1..&num_buckets};
   num num_cust_b  {1..&num_buckets};
   num pct_cust_b  {b in 1..&num_buckets} = num_cust_b[b] / num_cust_total;
   set <str> OBS_b {1..&num_buckets};
   for {<order,from,to> in PATH} do;
      cutoff[order]     = pd[to-1];
      avg_def_b[order]  = avg_def[from,to];
      error_b[order]    = error[from,to];
      num_cust_b[order] = num_cust_ij[from,to];
      OBS_b[order]      = setof {k in from..to-1} obs_id[k];
   end;
   print dist_from_source[dummy_obs,&num_buckets+1] cutoff avg_def_b error_b num_cust_b pct_cust_b;

   /* observations in each bucket */
/*   for {b in 1..&num_buckets} put OBS_b[b]=;*/
quit;
sai_ch
Obsidian | Level 7

@RobPratt Thanks for posting another solution to the problem. Appreciate it.

Ksharp
Super User
"bounds on the count of customers in a group instead of the number of observations in the group?"
They both are the same thing if Customer ID is unique in a group.
But I can do that for you if Customer ID has duplicate value in a group/bucket.(if you want it.)

The following code could get you faster , But still as original one :
BTW: you can set  -memsize 100G   in config file sasv9.conf  to make your SAS has more memory.







data have;
call streaminit(1234);
 do customer=1 to 200;
  pd=rand('uniform');
  def=rand('bern',0.5);
  cust_cnt=ceil(10*rand('uniform'));
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;



%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));

%let low=%sysevalf(0.01*&nobs,i);
%let high=%sysevalf(0.25*&nobs,i);
%let group=5;

proc iml;
use have;
read all var {pd def cust_cnt} into have ;
close ;

start function(x) global(have,nrow);
 xx=t(x);
 /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 range=start_end[,2]-start_end[,1];
 if any(range<&low) | any(range>&high) then obj=999999999;
 
 else do;
 n=nrow(start_end);
 temp=j(n,1,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=sum(abs(have[idx,1]-(sum(have[idx,2])/sum(have[idx,3]))));
 end;
 obj=sum(temp);
 end;
 
return (obj);
finish;

nrow=nrow(have);
group=&group;
 
encoding=j(2,nrow,1);
encoding[2,]=&group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,1000,1,1);
call gainit(id,10000,encoding);


niter = 1000; /*<-- Change it as large as you can to get the optimal value*/
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n;
run;
proc print noobs;run;


sai_ch
Obsidian | Level 7

@Ksharp The customer ID is unique within a group. But each customer ID in the rolled up dataset could have multiple underlying customers. For example, the first customer_ID of 1, which is unique, could have 100 corresponding customers. At the customer ID level the bounds are satisfied but once we disaggregate the bounds are violated.

sai_ch
Obsidian | Level 7

@RobPratt In reference to the code pasted below, do we have to multiply the sum of defaults by total customer count. The idea is to calculate simple average, not weighted. When I roll up the data, I am counting the total number defaults as wells customers.

 

num avg_def {<i,j> in OBS_PAIRS} = (

if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k] * cust_cnt[k]) / num_cust_ij[i,j]

else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j]

);

 

I am changing the part highlighted in red with the below code, please let me know what you think.

 

num avg_def {<i,j> in OBS_PAIRS} = (

if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k]) / num_cust_ij[i,j]

else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1] * cust_cnt[j-1]) / num_cust_ij[i,j]

);

 

 

RobPratt
SAS Super FREQ

OK, I was still treating your def column to still be just 0 or 1.  If you are already aggregating def, then your suggested change is the right idea, but you also need to eliminate the multiplication by cust_cnt[j-1] in the next line, too.  The corrected code is:

 

   /* avg_def[i,j] = (sum {k in i..j-1} def[k]) / num_cust_ij[i,j] */
   num avg_def {<i,j> in OBS_PAIRS} = (
      if <i,j-1> not in OBS_PAIRS then (sum {k in i..j-1} def[k]) / num_cust_ij[i,j] 
      else (avg_def[i,j-1] * num_cust_ij[i,j-1] + def[j-1]) / num_cust_ij[i,j] 
   );

 

sai_ch
Obsidian | Level 7

@RobPratt My bad, should have stated that the Def column has changed from 0 and 1s to total count. I made the change in the next line too. With further rolling up the data and using dynamic programming, I have been able to execute the code. Thanks again for that. Is there any available resource that will help me conceptually understand as to how the shortest path algorithm fits my problem.

 

Thanks

RobPratt
SAS Super FREQ

Glad to help.

 

The presentation I mentioned earlier in the thread is a good resource.

 

I also just now looked in Network Flows (1993) by Ahuja, Magnanti, and Orlin, and this problem appears as Exercise 4.6 with squared errors instead of absolute errors.  The same approach works for any objective that is a sum over buckets.

 

I suspect that the following paper describes this application, but I don't have a copy of it.  (Bellman is the originator of dynamic programming.)

Bellman, Richard. "A note on cluster analysis and dynamic programming." Mathematical Biosciences 18.3 (1973): 311-312.

 

For a fun application of side-constrained shortest paths, you might be interested in this blog post and accompanying SAS Global Forum paper:

http://blogs.sas.com/content/operations/2015/04/03/the-traveling-baseball-fan-problem/

sai_ch
Obsidian | Level 7

@RobPratt @Ksharp Is there a way to accept the both approaches as solutions. Thanks again for all the help and the patience.

Ksharp
Super User
Mark @RobPratt code as correct answer. My GA can't guarantee you to get the optimal value ,but Rob's could.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 40 replies
  • 5048 views
  • 3 likes
  • 3 in conversation