Hello
I run a similar code that Mr.Ksharp added in another question I have posted. (Thank you Mr.Ksharp)
I chose 10 groups and 100 iterations.
I round the variable to discrete numbers because the numbers can have a decimal value.
The analyzed variable is ithalv_expr_snif (Loan balance)
There is something that I don't understand-
I see groups with very low number of customers and also very low number of bads.
In the code there is condition:
if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
So I dont understand why such groups were created????
Also, I expect that worst group (lowest WOE) be group 1 and best group be group 10.
In the output groups order is not 1 to 10 .
%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';
proc delete data=work._all_;Run;
%let var=ithalv_expr_snif ; *the continuous variable you need to split;
%let group=10 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR good_bad KESHEL_FUTURE;
Run;
proc sql noprint;
select sum(good_bad='bad'),
sum(good_bad='good'),
floor(min(var)),ceil(max(var))
into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad n_good=&n_good min=&min max=&max;
proc sort data=temp;
by var ;
run;
proc iml;
use temp(where=(var is not missing));
read all var {var good_bad};
close;
start function(x) global(bin,var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var ,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789); /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like: */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/
data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;
proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;
data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;
title;
title "&var" ;
proc sql;
create table woe_var as
select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc; /**Worse group first***/
create index group on woe_var;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_var ;
quit;
title ' ';
"if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);"
That means Bad_Dist and Good_Dist are at least 5%.
Therefore, from your output , the result is not right .(Bad_Dist and Good_Dist 's value must be greater than 5%)
And in output of my code, you will see a MIN VALUE:
That stands for an ERROR term of GA. That means if it is near zero,the model is better,and you got GOOD bin/result.
And I guess your Min Value was very big.
The last thing is the number of groups you are binning is too big.
In general , 6 or 7 groups is good enough to get desired result.
Thus, choose 5 or 6 groups to get useful result.
Minimum value of the analyzed variable is 0.
Maximum value of the analyzed variable is 698163
I was asked to add also condition that minimum bads in a group is 100.
I run the code on 5 groups and 100 iterations.
Still get same problem
Here is the full code I run:
maybe something in my code is faulty???
%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';
proc delete data=work._all_;Run;
%let var=ithalv_expr_snif ; *the continuous variable you need to split;
%let group=5 ; *the number of group you want to bin to;
%let n_iter=100; *the number of iteration for Genetic Algorithm to get best WOE IV;
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR good_bad KESHEL_FUTURE;
Run;
proc sql noprint;
select sum(good_bad='bad'),
sum(good_bad='good'),
floor(min(var)),ceil(max(var))
into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad n_good=&n_good min=&min max=&max;
proc sort data=temp;
by var ;
run;
proc iml;
use temp;
read all var {var good_bad};
close;
start function(x) global(bin,var,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789); /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like: */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/
data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;
proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;
data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;
title;
title "&var" ;
proc sql;
create table woe_var as
select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc; /**Worse group first***/
create index group on woe_var;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_var ;
quit;
title ' ';
Here is the output
If your histogram graphic is long heavy tail , you should truncated or winsorized your data, like: (mentioned above)
Data temp; set r_r.TatPop_raw_Data_Of1_MB1_2; ...................... if ithalv_expr_snif >20000 then ithalv_expr_snif =20000; run;
Here, 20000 should be 95% percentile of ithalv_expr_snif variable (or less than 95% percentile).
After that, I thing you will get desired output!
BTW, If you really need 10 groups, then you need make number of iterations (like: 10000 or 100000) very big
and number of initial individuals(like:
call gainit(id,10000,encoding);
) very big
to make it happen.
But that would cost you a whole night to run code.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.