@RobPratt If it was upto me, I would have happily supplied the data. Since this data contains sensitive information, its difficult for me to share it. Sorry about that.
@Ksharp I was able to modify the objective function to first minimize the error for each observation in the bucket and then for all the buckets together. Is it possible to appy couple of constraints to the optimization. No bucket should have more than 25% of observations or less than 1%. Have been trying to use the repair strategy but with no success.
Thanks
To impose bounds on the percentage of customers in each bucket, you can introduce macro variables:
%let pct_lb = 0.01;
%let pct_ub = 0.25;
And then in the PROC OPTMODEL code, replace the declaration of CUSTOMER_PAIRS with these three lines:
num numcust_lb = ceil( &pct_lb*(card(CUSTOMERS)-1));
num numcust_ub = floor(&pct_ub*(card(CUSTOMERS)-1));
set CUSTOMER_PAIRS = {i in CUSTOMERS, j in i+numcust_lb..min(i+numcust_ub,dummy_customer)};
For @Ksharp's random data, the resulting optimal solution has objective value 1.1086:
[1] | cutoff | avg_pd_b | avg_def_b | error_b | num_cust_b | pct_cust_b |
---|---|---|---|---|---|---|
1 | 0.21685 | 0.11294 | 0.68000 | 0.5670646 | 25 | 0.25 |
2 | 0.45716 | 0.32589 | 0.29167 | 0.0342267 | 24 | 0.24 |
3 | 0.71595 | 0.58697 | 0.36000 | 0.2269716 | 25 | 0.25 |
4 | 0.99431 | 0.87622 | 0.60000 | 0.2762192 | 25 | 0.25 |
5 | 0.99592 | 0.99592 | 1.00000 | 0.0040805 | 1 | 0.01 |
Technically , I can do that for you, if you really need it. I suggest you run Rob's code firstly. If that is still not working, I am going to post IML code for your request.
@sai_ch, please check whether changing the error formula as follows matches your new objective for a small instance:
num error {<i,j> in CUSTOMER_PAIRS} = sum {k in i..j-1} abs(pd[k] - avg_def[i,j]);
Then I'll look into making the code more scalable.
@RobPratt @Ksharp I really appreciate all the help in solving this problem. I am still getting out of memory error with ~16k observations with OPTMODEL. Since this is my first exposure to OPTMODEL, its taking longer to understand and make small tweaks. I have used IML previously. I fully intend to get both solutions working.
I am going to randomly select 1000 observations and run the OPTMODEL code
. OK. Here is what I got . I also change the OBJ value according to your said. The GA can't guarantee you to get the best object value . Good Luck . data have; call streaminit(1234); do customer=1 to 99; pd=rand('uniform'); def=rand('bern',0.5); output; end; format pd percent8.2; run; proc sort data=have;by pd;run; %let dsid=%sysfunc(open(have)); %let nobs=%sysfunc(attrn(&dsid,nlobs)); %let dsid=%sysfunc(close(&dsid)); %let low=%sysevalf(0.01*&nobs,i); %let high=%sysevalf(0.25*&nobs,i); %let group=5; proc iml; use have; read all var {pd def} into have ; close ; start function(x) global(have,nrow,group,v,index); xx=t(x); /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/ call sort(xx,1); start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) || t(loc(t(xx)^=remove(xx,1)||{.})); range=start_end[,2]-start_end[,1]; if any(range<&low) | any(range>&high) then start_end=index; n=nrow(start_end); temp=j(n,1,.); do i=1 to n; idx=start_end[i,1]:start_end[i,2]; temp[i,1]=sum(abs(have[idx,1]-have[idx,2])); end; obj=sum(temp); return (obj); finish; nrow=nrow(have); group=&group; tmp=do(1,nrow,int(nrow/&group)); index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs); v=1:&group; encoding=j(2,nrow,1); encoding[2,]=&group; id=gasetup(2,nrow,123456789); call gasetobj(id,0,"function"); call gasetsel(id,1000,1,1); call gainit(id,10000,encoding); niter = 1000; /*<-- Change it as large as you can to get the optimal value*/ do i = 1 to niter; call garegen(id); call gagetval(value, id); end; call gagetmem(mem, value, id, 1); groups=t(mem); call sort(groups,1); create groups var {groups}; append; close; print value[l = "Min Value:"] ; call gaend(id); quit; data temp; merge groups have; run; proc summary data=temp; by groups; var pd def; output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n; run; proc print noobs;run;
@Ksharp, when I run your latest code, IML reports an objective value of 51.263465, with the following solution:
GROUPS | cut_off_value | avg_pd | avg_def | n |
---|---|---|---|---|
1 | 21.62% | 10.86% | 0.70833 | 24 |
2 | 31.39% | 25.67% | 0.30769 | 13 |
3 | 58.30% | 46.09% | 0.40000 | 25 |
4 | 79.20% | 68.63% | 0.29412 | 17 |
5 | 99.59% | 90.70% | 0.60000 | 20 |
Group 3 slightly violates the 25% bound (there are only 99 observations total). But the objective value also doesn't seem to match. Here's my understanding of the desired error calculation for the above solution, which yields an objective value of 29.95092598:
[1] | cutoff | avg_pd_b | avg_def_b | error_b | num_cust_b | pct_cust_b |
---|---|---|---|---|---|---|
1 | 0.21622 | 0.10861 | 0.70833 | 14.39347 | 24 | 0.24242 |
2 | 0.31388 | 0.25668 | 0.30769 | 0.67551 | 13 | 0.13131 |
3 | 0.58304 | 0.46088 | 0.40000 | 2.07486 | 25 | 0.25253 |
4 | 0.79205 | 0.68635 | 0.29412 | 6.66792 | 17 | 0.17172 |
5 | 0.99592 | 0.90696 | 0.60000 | 6.13917 | 20 | 0.20202 |
@sai_ch, is either one of these objective values correct?
@RobPratt @Ksharp Maybe I didn't explain the objective function to well. Hopefully, the modified code will explain it better, assuming its correct. I changed the input dataset slightly to account for roll up (~16k as opposed to ~200k) and to calculate the def rate accurately.
pd | def | cust_cnt |
0.001% | 0 | 1 |
0.002% | 0 | 3 |
0.003% | 0 | 2 |
n=nrow(start_end);
n1=nrow(xx);
temp=j(n1,3,.);
temp1=j(n,1,.);
do i=1 to n;
idx=start_end[i,1]:start_end[i,2];
do j=start_end[i,1] to start_end[i,2];
temp[j,1]=have[j,1];
temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);
temp[j,3]=abs(temp[j,1]-temp[j,2]);
end;
temp1[i,1]=sum(temp[idx,3]);
end;
obj=sum(temp1);
return (obj);
thanks
To get the weighted average of def for a bucket, I think you need to replace:
temp[j,2]=sum(have[idx,2])/sum(have[idx,3]);
With this:
temp[j,2]=sum(have[idx,2]*have[idx,3])/sum(have[idx,3]);
OK. Object value is clear now. but you didn't include 'cust_cnt' in your original data. And use vector operator to fast the code. data have; call streaminit(1234); do customer=1 to 99; pd=rand('uniform'); def=rand('bern',0.5); cust_cnt=ceil(10*rand('uniform')); output; end; format pd percent8.2; run; proc sort data=have;by pd;run; %let dsid=%sysfunc(open(have)); %let nobs=%sysfunc(attrn(&dsid,nlobs)); %let dsid=%sysfunc(close(&dsid)); %let low=%sysevalf(0.01*&nobs,i); %let high=%sysevalf(0.25*&nobs,i); %let group=5; proc iml; use have; read all var {pd def cust_cnt} into have ; close ; start function(x) global(have,nrow,group,v,index); xx=t(x); /*if countunique(xx) ^= group then xx=t(sample(v,nrow));*/ call sort(xx,1); start_end=t(loc(t(xx)^={.}||remove(xx,nrow))) || t(loc(t(xx)^=remove(xx,1)||{.})); range=start_end[,2]-start_end[,1]; if any(range<&low) | any(range>&high) then start_end=index; n=nrow(start_end); temp=j(n,1,.); do i=1 to n; idx=start_end[i,1]:start_end[i,2]; temp[i,1]=sum(abs(have[idx,1]-(sum(have[idx,2])/sum(have[idx,3])))); end; obj=sum(temp); return (obj); finish; nrow=nrow(have); group=&group; tmp=do(1,nrow,int(nrow/&group)); index=t({1}||remove(tmp,1||ncol(tmp))+1)||t(remove(tmp,1||ncol(tmp))||&nobs); v=1:&group; encoding=j(2,nrow,1); encoding[2,]=&group; id=gasetup(2,nrow,123456789); call gasetobj(id,0,"function"); call gasetsel(id,1000,1,1); call gainit(id,10000,encoding); niter = 1000; /*<-- Change it as large as you can to get the optimal value*/ do i = 1 to niter; call garegen(id); call gagetval(value, id); end; call gagetmem(mem, value, id, 1); groups=t(mem); call sort(groups,1); create groups var {groups}; append; close; print value[l = "Min Value:"] ; call gaend(id); quit; data temp; merge groups have; run; proc summary data=temp; by groups; var pd def; output out=want(drop=_:) max(pd)=cut_off_value mean=avg_pd avg_def n=n; run; proc print noobs;run;
@Ksharp When I originally posted the question, I wasn't rolling up the input dataset. But then ran into memory issues. To circumvent the memory issue, I rolled up the dataset and that necessitated the introduction of total count of customers.
@RobPratt I am thinking through the calculation of Def rate suggested by you.
@RobPratt You are right, the code is imposing restriction on the number of observations on each bucket. The number of customers per grade is violating the constraints. I put the new dataset with count of customers after @Ksharp had posted the solution with constraints.
If possible, if there is a way to put a constraint on the number of customer per bucket, that would be ideal.
Thanks
@Ksharp Is it possible to put bounds on the count of customers in a group instead of the number of observations in the group?
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.