BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RobPratt
SAS Super FREQ
Can you please supply the 16k observation data set?


sai_ch
Obsidian | Level 7

@RobPratt If it was upto me, I would have happily supplied the data. Since this data contains sensitive information, its difficult for me to share it. Sorry about that.

 

@Ksharp I was able to modify the objective function to first minimize the error for each observation in the bucket and then for all the buckets together. Is it possible to appy couple of constraints to the optimization. No bucket should have more than 25% of observations or less than 1%. Have been trying to use the repair strategy but with no success.

 

Thanks

RobPratt
SAS Super FREQ

To impose bounds on the percentage of customers in each bucket, you can introduce macro variables:

%let pct_lb = 0.01;
%let pct_ub = 0.25;


And then in the PROC OPTMODEL code, replace the declaration of CUSTOMER_PAIRS with these three lines:

num numcust_lb = ceil( &pct_lb*(card(CUSTOMERS)-1));
num numcust_ub = floor(&pct_ub*(card(CUSTOMERS)-1));
set CUSTOMER_PAIRS = {i in CUSTOMERS, j in i+numcust_lb..min(i+numcust_ub,dummy_customer)};

For @Ksharp's random data, the resulting optimal solution has objective value 1.1086:

 

[1] cutoff avg_pd_b avg_def_b error_b num_cust_b pct_cust_b
1 0.21685 0.11294 0.68000 0.5670646 25 0.25
2 0.45716 0.32589 0.29167 0.0342267 24 0.24
3 0.71595 0.58697 0.36000 0.2269716 25 0.25
4 0.99431 0.87622 0.60000 0.2762192 25 0.25
5 0.99592 0.99592 1.00000 0.0040805 1 0.01
Ksharp
Super User
Technically , I can do that for you, if you really need it.
I suggest you run Rob's code firstly. 
If that is still not working, I am going to post IML code for your request.

RobPratt
SAS Super FREQ

@sai_ch, please check whether changing the error formula as follows matches your new objective for a small instance:

 

num error {<i,j> in CUSTOMER_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);

Then I'll look into making the code more scalable.

sai_ch
Obsidian | Level 7

@RobPratt @Ksharp I really appreciate all the help in solving this problem. I am still getting out of memory error with ~16k observations with OPTMODEL. Since this is my first exposure to OPTMODEL, its taking longer to understand and make small tweaks. I have used IML previously. I fully intend to get both solutions working.

 

I am going to randomly select 1000 observations and run the OPTMODEL code

Ksharp
Super User
.
OK. Here is what I got . I also change the OBJ value according to your said.
The GA can't guarantee you to get the best object value . 
Good Luck .


data have;
call streaminit(1234);
 do customer=1 to 99;
  pd=rand('uniform');
  def=rand('bern',0.5);
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;



%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));

%let low=%sysevalf(0.01*&nobs,i);
%let high=%sysevalf(0.25*&nobs,i);
%let group=5;

proc iml;
use have;
read all var {pd def} into have ;
close ;

start function(x) global(have,nrow,group,v,index);
 xx=t(x);
 /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 range=start_end[,2]-start_end[,1];
 if any(range<&low) | any(range>&high) then start_end=index;
 
 n=nrow(start_end);
 temp=j(n,1,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=sum(abs(have[idx,1]-have[idx,2]));
 end;
 obj=sum(temp);
return (obj);
finish;

nrow=nrow(have);
group=&group;
 
tmp=do(1,nrow,int(nrow/&group));
index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs);

v=1:&group;
encoding=j(2,nrow,1);
encoding[2,]=&group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,1000,1,1);
call gainit(id,10000,encoding);


niter = 1000; /*<-- Change it as large as you can to get the optimal value*/
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n;
run;
proc print noobs;run;




RobPratt
SAS Super FREQ

@Ksharp, when I run your latest code, IML reports an objective value of 51.263465, with the following solution:

 

GROUPS cut_off_value avg_pd avg_def n
1 21.62% 10.86% 0.70833 24
2 31.39% 25.67% 0.30769 13
3 58.30% 46.09% 0.40000 25
4 79.20% 68.63% 0.29412 17
5 99.59% 90.70% 0.60000 20

 

Group 3 slightly violates the 25% bound (there are only 99 observations total).  But the objective value also doesn't seem to match.  Here's my understanding of the desired error calculation for the above solution, which yields an objective value of 29.95092598:

 

[1] cutoff avg_pd_b avg_def_b error_b num_cust_b pct_cust_b
1 0.21622 0.10861 0.70833 14.39347 24 0.24242
2 0.31388 0.25668 0.30769 0.67551 13 0.13131
3 0.58304 0.46088 0.40000 2.07486 25 0.25253
4 0.79205 0.68635 0.29412 6.66792 17 0.17172
5 0.99592 0.90696 0.60000 6.13917 20 0.20202

 

@sai_ch, is either one of these objective values correct?

sai_ch
Obsidian | Level 7

@RobPratt @Ksharp Maybe I didn't explain the objective function to well. Hopefully, the modified code will explain it better, assuming its correct. I changed the input dataset slightly to account for roll up (~16k as opposed to ~200k) and to calculate the def rate accurately.

 

pddefcust_cnt
0.001%01
0.002%03
0.003%02

 

n=nrow(start_end);

n1=nrow(xx);

temp=j(n1,3,.);

temp1=j(n,1,.);

do i=1 to n;

idx=start_end[i,1]:start_end[i,2];

do j=start_end[i,1] to start_end[i,2];

temp[j,1]=have[j,1];

temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);

temp[j,3]=abs(temp[j,1]-temp[j,2]);

end;

temp1[i,1]=sum(temp[idx,3]);

end;

obj=sum(temp1);

return (obj);

 

thanks

 

RobPratt
SAS Super FREQ

To get the weighted average of def for a bucket, I think you need to replace:

 

temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);

 

With this:

 

temp[j,2]=sum(have[idx,2]*have[idx,3])/sum(have[idx,3]);
Ksharp
Super User
OK. Object value is clear now. but you didn't include 'cust_cnt' in your original data.
And use vector operator to fast the code.




data have;
call streaminit(1234);
 do customer=1 to 99;
  pd=rand('uniform');
  def=rand('bern',0.5);
  cust_cnt=ceil(10*rand('uniform'));
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;



%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));

%let low=%sysevalf(0.01*&nobs,i);
%let high=%sysevalf(0.25*&nobs,i);
%let group=5;

proc iml;
use have;
read all var {pd def cust_cnt} into have ;
close ;

start function(x) global(have,nrow,group,v,index);
 xx=t(x);
 /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 range=start_end[,2]-start_end[,1];
 if any(range<&low) | any(range>&high) then start_end=index;
 
 n=nrow(start_end);
 temp=j(n,1,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=sum(abs(have[idx,1]-(sum(have[idx,2])/sum(have[idx,3]))));
 end;
 obj=sum(temp);
return (obj);
finish;

nrow=nrow(have);
group=&group;
 
tmp=do(1,nrow,int(nrow/&group));
index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs);

v=1:&group;
encoding=j(2,nrow,1);
encoding[2,]=&group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,1000,1,1);
call gainit(id,10000,encoding);


niter = 1000; /*<-- Change it as large as you can to get the optimal value*/
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n;
run;
proc print noobs;run;


sai_ch
Obsidian | Level 7

@Ksharp When I originally posted the question, I wasn't rolling up the input dataset. But then ran into memory issues. To circumvent the memory issue, I rolled up the dataset and that necessitated the introduction of total count of customers.

 

@RobPratt I am thinking through the calculation of Def rate suggested by you.

RobPratt
SAS Super FREQ

It looks like @Ksharp's latest code imposes bounds on the percentage of observations per bucket rather than on the percentage of customers per bucket.

 

@sai_ch, please clarify which one you want.

sai_ch
Obsidian | Level 7

@RobPratt You are right, the code is imposing restriction on the number of observations on each bucket. The number of customers per grade is violating the constraints. I put the new dataset with count of customers after @Ksharp had posted the solution with constraints.

 

If possible, if there is a way to put a constraint on the number of customer per bucket, that would be ideal.

 

Thanks 

sai_ch
Obsidian | Level 7

@Ksharp Is it possible to put bounds on the count of customers in a group instead of the number of observations in the group?

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 40 replies
  • 5045 views
  • 3 likes
  • 3 in conversation