Solved: Re: Finding the optimal cut-off for each bucket - Page 2

RobPratt · Posted 08-16-2016 11:36 AM

Can you please supply the 16k observation data set?

sai_ch · Posted 08-16-2016 06:14 PM

@RobPratt If it was upto me, I would have happily supplied the data. Since this data contains sensitive information, its difficult for me to share it. Sorry about that.

@Ksharp I was able to modify the objective function to first minimize the error for each observation in the bucket and then for all the buckets together. Is it possible to appy couple of constraints to the optimization. No bucket should have more than 25% of observations or less than 1%. Have been trying to use the repair strategy but with no success.

Thanks

RobPratt · Posted 08-16-2016 10:07 PM

To impose bounds on the percentage of customers in each bucket, you can introduce macro variables:

%let pct_lb = 0.01;
%let pct_ub = 0.25;

And then in the PROC OPTMODEL code, replace the declaration of CUSTOMER_PAIRS with these three lines:

num numcust_lb = ceil( &pct_lb*(card(CUSTOMERS)-1));
num numcust_ub = floor(&pct_ub*(card(CUSTOMERS)-1));
set CUSTOMER_PAIRS = {i in CUSTOMERS, j in i+numcust_lb..min(i+numcust_ub,dummy_customer)};

For @Ksharp's random data, the resulting optimal solution has objective value 1.1086:

[1]	cutoff	avg_pd_b	avg_def_b	error_b	num_cust_b	pct_cust_b
1	0.21685	0.11294	0.68000	0.5670646	25	0.25
2	0.45716	0.32589	0.29167	0.0342267	24	0.24
3	0.71595	0.58697	0.36000	0.2269716	25	0.25
4	0.99431	0.87622	0.60000	0.2762192	25	0.25
5	0.99592	0.99592	1.00000	0.0040805	1	0.01

Ksharp · Posted 08-16-2016 11:14 PM

Technically , I can do that for you, if you really need it.
I suggest you run Rob's code firstly. 
If that is still not working, I am going to post IML code for your request.

RobPratt · Posted 08-16-2016 11:41 PM

@sai_ch, please check whether changing the error formula as follows matches your new objective for a small instance:

num error {<i,j> in CUSTOMER_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);

Then I'll look into making the code more scalable.

sai_ch · Posted 08-17-2016 08:42 AM

@RobPratt @Ksharp I really appreciate all the help in solving this problem. I am still getting out of memory error with ~16k observations with OPTMODEL. Since this is my first exposure to OPTMODEL, its taking longer to understand and make small tweaks. I have used IML previously. I fully intend to get both solutions working.

I am going to randomly select 1000 observations and run the OPTMODEL code

Ksharp · Posted 08-17-2016 09:06 AM

.
OK. Here is what I got . I also change the OBJ value according to your said.
The GA can't guarantee you to get the best object value . 
Good Luck .


data have;
call streaminit(1234);
 do customer=1 to 99;
  pd=rand('uniform');
  def=rand('bern',0.5);
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;



%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));

%let low=%sysevalf(0.01*&nobs,i);
%let high=%sysevalf(0.25*&nobs,i);
%let group=5;

proc iml;
use have;
read all var {pd def} into have ;
close ;

start function(x) global(have,nrow,group,v,index);
 xx=t(x);
 /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 range=start_end[,2]-start_end[,1];
 if any(range<&low) | any(range>&high) then start_end=index;
 
 n=nrow(start_end);
 temp=j(n,1,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=sum(abs(have[idx,1]-have[idx,2]));
 end;
 obj=sum(temp);
return (obj);
finish;

nrow=nrow(have);
group=&group;
 
tmp=do(1,nrow,int(nrow/&group));
index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs);

v=1:&group;
encoding=j(2,nrow,1);
encoding[2,]=&group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,1000,1,1);
call gainit(id,10000,encoding);


niter = 1000; /*<-- Change it as large as you can to get the optimal value*/
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n;
run;
proc print noobs;run;

RobPratt · Posted 08-17-2016 09:57 AM

@Ksharp, when I run your latest code, IML reports an objective value of 51.263465, with the following solution:

GROUPS	cut_off_value	avg_pd	avg_def	n
1	21.62%	10.86%	0.70833	24
2	31.39%	25.67%	0.30769	13
3	58.30%	46.09%	0.40000	25
4	79.20%	68.63%	0.29412	17
5	99.59%	90.70%	0.60000	20

Group 3 slightly violates the 25% bound (there are only 99 observations total). But the objective value also doesn't seem to match. Here's my understanding of the desired error calculation for the above solution, which yields an objective value of 29.95092598:

[1]	cutoff	avg_pd_b	avg_def_b	error_b	num_cust_b	pct_cust_b
1	0.21622	0.10861	0.70833	14.39347	24	0.24242
2	0.31388	0.25668	0.30769	0.67551	13	0.13131
3	0.58304	0.46088	0.40000	2.07486	25	0.25253
4	0.79205	0.68635	0.29412	6.66792	17	0.17172
5	0.99592	0.90696	0.60000	6.13917	20	0.20202

@sai_ch, is either one of these objective values correct?

sai_ch · Posted 08-17-2016 10:34 AM

@RobPratt @Ksharp Maybe I didn't explain the objective function to well. Hopefully, the modified code will explain it better, assuming its correct. I changed the input dataset slightly to account for roll up (~16k as opposed to ~200k) and to calculate the def rate accurately.

pd	def	cust_cnt
0.001%	0	1
0.002%	0	3
0.003%	0	2

n=nrow(start_end);

n1=nrow(xx);

temp=j(n1,3,.);

temp1=j(n,1,.);

do i=1 to n;

idx=start_end[i,1]:start_end[i,2];

do j=start_end[i,1] to start_end[i,2];

temp[j,1]=have[j,1];

temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);

temp[j,3]=abs(temp[j,1]-temp[j,2]);

end;

temp1[i,1]=sum(temp[idx,3]);

end;

obj=sum(temp1);

return (obj);

thanks

RobPratt · Posted 08-17-2016 05:37 PM

To get the weighted average of def for a bucket, I think you need to replace:

temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);

With this:

temp[j,2]=sum(have[idx,2]*have[idx,3])/sum(have[idx,3]);

Ksharp · Posted 08-18-2016 01:01 AM

OK. Object value is clear now. but you didn't include 'cust_cnt' in your original data.
And use vector operator to fast the code.




data have;
call streaminit(1234);
 do customer=1 to 99;
  pd=rand('uniform');
  def=rand('bern',0.5);
  cust_cnt=ceil(10*rand('uniform'));
  output;
 end;
 format pd percent8.2;
run;
proc sort data=have;by pd;run;



%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));

%let low=%sysevalf(0.01*&nobs,i);
%let high=%sysevalf(0.25*&nobs,i);
%let group=5;

proc iml;
use have;
read all var {pd def cust_cnt} into have ;
close ;

start function(x) global(have,nrow,group,v,index);
 xx=t(x);
 /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/
 call sort(xx,1);
 start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) ||
           t(loc(t(xx)^=remove(xx,1)||{.}));
 range=start_end[,2]-start_end[,1];
 if any(range<&low) | any(range>&high) then start_end=index;
 
 n=nrow(start_end);
 temp=j(n,1,.);
 do i=1 to n;
  idx=start_end[i,1]:start_end[i,2];
  temp[i,1]=sum(abs(have[idx,1]-(sum(have[idx,2])/sum(have[idx,3]))));
 end;
 obj=sum(temp);
return (obj);
finish;

nrow=nrow(have);
group=&group;
 
tmp=do(1,nrow,int(nrow/&group));
index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs);

v=1:&group;
encoding=j(2,nrow,1);
encoding[2,]=&group;    

id=gasetup(2,nrow,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,1000,1,1);
call gainit(id,10000,encoding);


niter = 1000; /*<-- Change it as large as you can to get the optimal value*/
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

groups=t(mem);
call sort(groups,1);
create groups var {groups};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
 

data temp;
 merge groups have;
run;
proc summary data=temp;
 by groups;
 var pd def;
 output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n;
run;
proc print noobs;run;

sai_ch · Posted 08-18-2016 08:29 AM

@Ksharp When I originally posted the question, I wasn't rolling up the input dataset. But then ran into memory issues. To circumvent the memory issue, I rolled up the dataset and that necessitated the introduction of total count of customers.

@RobPratt I am thinking through the calculation of Def rate suggested by you.

RobPratt · Posted 08-18-2016 11:20 AM

It looks like @Ksharp's latest code imposes bounds on the percentage of observations per bucket rather than on the percentage of customers per bucket.

@sai_ch, please clarify which one you want.

sai_ch · Posted 08-18-2016 11:31 AM

@RobPratt You are right, the code is imposing restriction on the number of observations on each bucket. The number of customers per grade is violating the constraints. I put the new dataset with count of customers after @Ksharp had posted the solution with constraints.

If possible, if there is a way to put a constraint on the number of customer per bucket, that would be ideal.

Thanks

sai_ch · Posted 08-18-2016 02:44 PM

@Ksharp Is it possible to put bounds on the count of customers in a group instead of the number of observations in the group?

SAS Innovate 2025: Register Now