Hello
I run a similar code that Mr.Ksharp added in another question I have posted. (Thank you Mr.Ksharp)
I chose 10 groups and 100 iterations.
I round the variable to discrete numbers because the numbers can have a decimal value.
The analyzed variable is ithalv_expr_snif (Loan balance)
There is something that I don't understand-
I see groups with very low number of customers and also very low number of bads.
In the code there is condition:
if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
So I dont understand why such groups were created????
Also, I expect that worst group (lowest WOE) be group 1 and best group be group 10.
In the output groups order is not 1 to 10 .
%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';
proc delete data=work._all_;Run;
%let var=ithalv_expr_snif ; *the continuous variable you need to split;
%let group=10 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR good_bad KESHEL_FUTURE;
Run;
proc sql noprint;
select sum(good_bad='bad'),
sum(good_bad='good'),
floor(min(var)),ceil(max(var))
into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad n_good=&n_good min=&min max=&max;
proc sort data=temp;
by var ;
run;
proc iml;
use temp(where=(var is not missing));
read all var {var good_bad};
close;
start function(x) global(bin,var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var ,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789); /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like: */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/
data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;
proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;
data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;
title;
title "&var" ;
proc sql;
create table woe_var as
select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc; /**Worse group first***/
create index group on woe_var;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_var ;
quit;
title ' ';
"if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);"
That means Bad_Dist and Good_Dist are at least 5%.
Therefore, from your output , the result is not right .(Bad_Dist and Good_Dist 's value must be greater than 5%)
And in output of my code, you will see a MIN VALUE:
That stands for an ERROR term of GA. That means if it is near zero,the model is better,and you got GOOD bin/result.
And I guess your Min Value was very big.
The last thing is the number of groups you are binning is too big.
In general , 6 or 7 groups is good enough to get desired result.
Thus, choose 5 or 6 groups to get useful result.
Minimum value of the analyzed variable is 0.
Maximum value of the analyzed variable is 698163
I was asked to add also condition that minimum bads in a group is 100.
I run the code on 5 groups and 100 iterations.
Still get same problem
Here is the full code I run:
maybe something in my code is faulty???
%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';
proc delete data=work._all_;Run;
%let var=ithalv_expr_snif ; *the continuous variable you need to split;
%let group=5 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR good_bad KESHEL_FUTURE;
Run;
proc sql noprint;
select sum(good_bad='bad'),
sum(good_bad='good'),
floor(min(var)),ceil(max(var))
into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad n_good=&n_good min=&min max=&max;
proc sort data=temp;
by var ;
run;
proc iml;
use temp;
read all var {var good_bad};
close;
start function(x) global(bin,var,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789); /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like: */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/
data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;
proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;
data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;
title;
title "&var" ;
proc sql;
create table woe_var as
select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc; /**Worse group first***/
create index group on woe_var;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_var ;
quit;
title ' ';
Here is the output
If your histogram graphic is long heavy tail , you should truncated or winsorized your data, like: (mentioned above)
Data temp; set r_r.TatPop_raw_Data_Of1_MB1_2; ...................... if ithalv_expr_snif >20000 then ithalv_expr_snif =20000; run;
Here, 20000 should be 95% percentile of ithalv_expr_snif variable (or less than 95% percentile).
After that, I thing you will get desired output!
BTW, If you really need 10 groups, then you need make number of iterations (like: 10000 or 100000) very big
and number of initial individuals(like:
call gainit(id,10000,encoding);
) very big
to make it happen.
But that would cost you a whole night to run code.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.