BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I run a similar code that Mr.Ksharp added in another question I have posted. (Thank you Mr.Ksharp)

I chose 10 groups and 100 iterations.

I round the variable to discrete numbers because the numbers can have  a decimal value.

The analyzed variable is ithalv_expr_snif (Loan balance)

There is something that I don't understand-

I see groups with very low number of customers and also very low number of bads.

In the code there is condition:

if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;

So I dont understand why such groups were created????

Also, I expect that worst group (lowest WOE) be group 1 and best group be group 10.

In the output groups order is not 1 to 10 .

Ronein_0-1724228247350.png

 

 

 

 

 

%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';

proc delete data=work._all_;Run;

%let var=ithalv_expr_snif ;  *the continuous variable you need to split;
%let group=10 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;

Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR   good_bad   KESHEL_FUTURE;
Run;
 

proc sql noprint;
select sum(good_bad='bad'),
        sum(good_bad='good'),
        floor(min(var)),ceil(max(var)) 
        into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad   n_good=&n_good   min=&min   max=&max;


proc sort data=temp;
by var ;
run;


proc iml;
use temp(where=(var is not missing));
read all var {var good_bad};
close;
start function(x) global(bin,var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var ,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ; 
good_dist=n_good/&n_good ; 
if Bad_Dist>0.05 and Good_Dist>0.05  then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;  
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;    
	id=gasetup(2,group-1,123456789);  /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;

/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:  */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/



data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;


proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;


data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_var as
 select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe desc;  /**Worse group first***/
create index group on woe_var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_var ;
quit;
title ' ';

 

 

7 REPLIES 7
Ksharp
Super User

"if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);"

That  means Bad_Dist and Good_Dist are at least 5%.

Therefore, from your output , the result is not right .(Bad_Dist and Good_Dist 's value must be greater than 5%)

 

And in output of my code, you will see a MIN VALUE:

Ksharp_0-1724229011255.png

 

That stands for an ERROR term of GA. That means if it is near zero,the model is better,and you got GOOD bin/result.

And I guess your Min Value was very big.

 

The last thing is the number of groups you are binning is too big. 

In general , 6 or 7 groups is good enough to get desired result.

Thus, choose  5 or 6 groups to get useful result.

 

Ronein
Meteorite | Level 14

Minimum value of the analyzed variable is 0.

Maximum value of the analyzed variable is    698163

I was asked to add also condition that minimum bads in a group is 100.

I run the code on 5 groups and 100 iterations.

Still get same problem

Here is the full code I run: 

maybe something in my code is faulty???

 

%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';

proc delete data=work._all_;Run;

%let var=ithalv_expr_snif ;  *the continuous variable you need to split;
%let group=5 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;

Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR   good_bad   KESHEL_FUTURE;
Run;
 
 

proc sql noprint;
select sum(good_bad='bad'),
        sum(good_bad='good'),
        floor(min(var)),ceil(max(var)) 
        into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad   n_good=&n_good   min=&min   max=&max;


proc sort data=temp;
by var ;
run;

 

proc iml;
use temp;
read all var {var good_bad};
close;
start function(x) global(bin,var,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ; 
good_dist=n_good/&n_good ; 
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;  
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;    
	id=gasetup(2,group-1,123456789);  /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;

/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:  */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/



data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;


proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;


data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_var as
 select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe desc;  /**Worse group first***/
create index group on woe_var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_var ;
quit;
title ' ';

 

Here is the output  

Ronein_0-1724231088200.png

 

 

Ksharp
Super User
1)Post your Min Value I mentioned above.
I guess it is 999999.
2) You change my code ? Why ? Don't change my code unless you know SAS/IML syntax.
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then
-->
if n_good>0 & n_bad>100 then

2) DO NOT CHANGE MY CODE.

3)Make the following number as big as you can. So you can get what you want.
%let n_iter=10000;
call gainit(id,10000,encoding);
Ksharp
Super User
"
Minimum value of the analyzed variable is 0.

Maximum value of the analyzed variable is 698163
"

One more thing:
Post a Histogram Graphic by
proc univariate .....
histogram ithalv_expr_snif ;
run;

And If I was right. you got a HEAVY LONG tail . That is not good for GA.
You can truncate/winsorize it to make my code more efficient and could get you better result. Like:
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
......................
if ithalv_expr_snif >20000 then ithalv_expr_snif =20000;
Ksharp
Super User
Firstly ,Try to use my original code.
DO NOT change my code !

After you truncated/winsorized variable "ithalv_expr_snif " as I showed above and use my ORIGINAL code , I believe you will get the desired result .
Ksharp
Super User

If your histogram graphic is long heavy tail , you should truncated or winsorized your data, like: (mentioned above)

Data temp; 
set r_r.TatPop_raw_Data_Of1_MB1_2; 
...................... 
if ithalv_expr_snif >20000 then ithalv_expr_snif =20000;
run;

 

 Here, 20000 should be 95% percentile of ithalv_expr_snif variable (or less than 95% percentile).

After that, I thing you will get desired output!

Ksharp
Super User

BTW, If you really need 10 groups, then you need make number of iterations (like: 10000 or 100000) very big
and number of initial individuals(like:
call gainit(id,10000,encoding);
) very big 
to make it happen.
But that would cost you a whole night to run code.

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1347 views
  • 1 like
  • 2 in conversation