run your code on my data on specific variable (Nr Years in bank )
In your code the best group should be 1 and worse should be 4
I don't see that Per increasing consistently by moving from group 1 to 2 and 2 to 3 and 3 to 4.
Is it not part of the requirement??? To have consistently increasing in per (percentage of bads)???
The code you send me is running 100%
It provide good results but my friend provide better results via Miner (IGN)
It would be best if can provide such results in SAS code.
By the way- what is the alogorithm name that you used in your code?
"It provide good results but my friend provide better results via Miner (IGN)"
You could try SAS/EM 's ScoreCard module,but that need money to pay.
And I never heard of Miner ,what kind of software it is?
"It would be best if can provide such results in SAS code."
1)You could ask @RobPratt to write SAS/OR code for it ,and also need to pay money to get SAS/OR.
2)Make %let n_iter=10000 ; as big as you can , but that maybe cost you an entire day to get better result.
And make the following 1000 as big as you can. E.X. 10000
call gainit(id,1000,encoding);
--->
call gainit(id,10000,encoding);
"what is the alogorithm name that you used in your code?"
Genetic Algorithm. Check @Rick_SAS blog:
https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html
https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html
That(Miner) is just the SAS/EM I mentioned before.
It looks like your friend use SAS/EM 's ScoreCard module get the better result.
Actually behind SAS/EM 's ScoreCard module, there are some SAS/OR code to get the better/optimal result . That does not surprise to me .
Except to make the follow number bigger to get better result:
%let n_iter=10000 ;
call gainit(id,10000,encoding)
You also could try to using another ENCODING, since your data contains decimal point, that might get you better result too.
id=gasetup(1,group-1,123456789);
You might be interested in PROC OPTBINNING:
Hello Ksharp,
I have some questions please:
There are 3 possible cases about the type of predictor variable.
Type1- continuous Var (For example- welalth
Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)
Type3- categorical var(For example: education type)
Can you please tell how to modify your code for each of the cases?
As I understand your code is for Type1
Can you explain
id=gasetup(2,group-1,123456789); is it for type1???
id=gasetup(1,group-1,123456789); is it for type2???
can you exmplain
call gainit(id,1000,encoding);call gainit(id,10000,encoding)
is the number 10000 should be equal to number of iterations??? (%let n_iter=10000)
proc delete data=work._all_;Run;
%let var= ithalv_expr_snif ; *the continuous variable you need to split;
%let group=5 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad ');
if not missing(&var);
keep &var good_bad KESHEL_FUTURE ;
run;
proc sql noprint;
select sum(good_bad='bad'),sum(good_bad='good'),
floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad n_good=&n_good min=&min max=&max;
proc sort data=temp;
by &var ;
run;
proc iml;
use temp(where=(&var is not missing));
read all var {&var good_bad};
close;
start function(x) global(bin,&var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
data all_group;
set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;
proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;
data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;
title;
title "&var" ;
proc sql;
create table woe_&var as
select group label=' ',
min(&var) as min label='min',
max(&var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='Tamhil_N',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='Tamhil_Bads' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='Tamhil_Goods' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc; /**Worse group first***/
create index group on woe_&var;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_&var ;
quit;
title ' ';
"
Type1- continuous Var (For example- welalth
Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)
Type3- categorical var(For example: education type)
Can you please tell how to modify your code for each of the cases?
As I understand your code is for Type1
"
Answer: Yes. My code is for Type1. Specially , my code is for integer value:
id=gasetup(2,group-1,123456789);
here 2 means integer encoding.
If your variable have decimal point, try the following two ways:
(or just using my original code if no need to consider about decimal point )
1) using real number encoding by 1 (I would not recomment due to cost more time than 2)
id=gasetup(1,group-1,123456789);
2)multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:
ithalv_expr_snif=100*ithalv_expr_snif;
.............
id=gasetup(2,group-1,123456789);
For Type2 , you could try "2)" I mentioned above.
For Type3, you could try the following code:
proc import datafile='D:\ifre_backup\Ksharp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;
%let var=purpose ;
%let group=5 ;
%let n_iter=100;
data temp;
set have;
keep &var good_bad ;
run;
proc sql noprint;
select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
from temp;
create index &var on temp;
quit;
%put &n_bad &n_good ;
proc iml;
use temp ;
read all var {&var good_bad};
close;
start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;
do i=1 to group;
idx=loc(x=i);
levels=level[idx];
index=loc(element(&var,levels));
temp=good_bad[index];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
level=unique(&var);
n_level=countunique(&var);
encoding=j(2,n_level,1 );
encoding[2,]=group ;
id=gasetup(2,n_level,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
create group var{col_mem level};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
data all;
merge temp group(rename=(level=&var col_mem=group)) ;
by &var;
run;
title "变量: &var" ;
proc sql;
create table woe_&var as
select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe;
create index group on woe_&var;
create index col_mem on group;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_&var ;
quit;
title ' ';
proc sgplot data=woe_&var noautolegend;
vbar group/response=woe nostatlabel missing;
vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled
size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
FILLEDOUTLINEDMARKERS;
run;
"
Can you explain
id=gasetup(2,group-1,123456789); is it for type1???
id=gasetup(1,group-1,123456789); is it for type2???
"
Answer: It is the initial code for GA.
id=gasetup(2,group-1,123456789); is it for type1???
Yes. Specially for integer number.If you don't care about decimal point, you could use my original code.
If you take into account of decimal point(E.X. your variable is a Rate),you could multiply 100 or 1000 or 10000 as "2)" I mentioned above and using my original code .
"
can you exmplain
call gainit(id,1000,encoding);call gainit(id,10000,encoding)
is the number 10000 should be equal to number of iterations??? (%let n_iter=10000)
"
Answer: It is the initial code for GA.
1000 means generate 1000 individuals in the first iteration.The bigger it is ,the better initial value you would get.
No.They would not be equal. They are two different thing . 1000 in call gainit() is the initial number of population , "%let n_iter=1000" is the number of iteration of GA, they could be different. Both are bigger,you could get better result.
Hi @Ronein et al!
As @Ksharp suggested, you an use PROC OPTMODEL to create monotonic bins that meet your criteria and maximize IV (also known as symmetric Kullback-Leibler Divergence). I posted some code to do that five years ago:
Trying to use PROC OPTMODEL for monotonic supervised optimal binning o... - SAS Support Communities
It uses integer programming and it's not very efficient, and it can take a very long time to run, even on small problems. That is the nature of combinatorial optimization.
I have attached some code that uses a slightly less optimal two-stage approach. The first stage is an isotonic regression, which creates the closest fitting monotonic sequence from the original data; you can think of it as very granular binning, for which you have no control over the number or sizes of the bins. The second stage uses a dynamic programming approach to create the bins with the properties you require (number and sizes of bins, but you'd have to add the constraint on the minimum event number in each bin). The dynamic programming doesn't run into the same computational issues as the integer programming approach, but it can't impose monotonicity; that's why you precede it with the isotonic regression, which forces the subsequent bins to remain monotonic.
The five files run in the following order:
1. _0_import_german_credit.csv_.txt reads in the data
2. _1_agg_gc_credit_amount_.txt aggregates to a single observation for each unique predictor variable value
3. _3_agg_ds_for_optmodel_.txt adds some cumulative variables to the aggregated data
4. _5_isoreg_decrease_data_step_.txt runs the isotonic regression on the aggregated data
5. _100_gcdp011_post_isoreg_4+-gps_40-minsz_ivx_.txt runs the dynamic programming algorithm on the output of the isotonic regression. In this case, it seeks at least four bins with a minimum size of forty records, and maximum IV.
I hope you find this helpful.
Hi Ksharp
You mentioned that if my var type is continuous with decimal (for example ratio var: Ratio height to weight)
then need to multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:
new_variable=100*variable;
My question:
Is it also need to use round function to verify that my var has no decimal point?
Is the action of multiply by 100 will remove the decimals ??? or need also use round???
new_variable_2 = round(new_variable, 1);
as I see need also use round function.
%let var= y_AvgNizOsh_MisgOsh ; *the continuous variable you need to split;
%let group=5 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad ');
if not missing(&var);
VAR=round(100*&var.,1);
keep &var. VAR good_bad KESHEL_FUTURE ;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.