Binning coninous var into categorical var- score card credit risk mod...

Ronein · Posted 08-21-2024 04:15 AM

Hello

I run a similar code that Mr.Ksharp added in another question I have posted. (Thank you Mr.Ksharp)

I chose 10 groups and 100 iterations.

I round the variable to discrete numbers because the numbers can have a decimal value.

The analyzed variable is ithalv_expr_snif (Loan balance)

There is something that I don't understand-

I see groups with very low number of customers and also very low number of bads.

In the code there is condition:

if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;

So I dont understand why such groups were created????

Also, I expect that worst group (lowest WOE) be group 1 and best group be group 10.

In the output groups order is not 1 to 10 .

%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';

proc delete data=work._all_;Run;

%let var=ithalv_expr_snif ;  *the continuous variable you need to split;
%let group=10 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;

Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR   good_bad   KESHEL_FUTURE;
Run;
 

proc sql noprint;
select sum(good_bad='bad'),
        sum(good_bad='good'),
        floor(min(var)),ceil(max(var)) 
        into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad   n_good=&n_good   min=&min   max=&max;


proc sort data=temp;
by var ;
run;


proc iml;
use temp(where=(var is not missing));
read all var {var good_bad};
close;
start function(x) global(bin,var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var ,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ; 
good_dist=n_good/&n_good ; 
if Bad_Dist>0.05 and Good_Dist>0.05  then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;  
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;    
	id=gasetup(2,group-1,123456789);  /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;

/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:  */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/



data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;


proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;


data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_var as
 select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe desc;  /**Worse group first***/
create index group on woe_var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_var ;
quit;
title ' ';

Ksharp · Posted 08-21-2024 04:35 AM

"if Bad_Dist>0.05 and Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);"

That means Bad_Dist and Good_Dist are at least 5%.

Therefore, from your output , the result is not right .(Bad_Dist and Good_Dist 's value must be greater than 5%)

And in output of my code, you will see a MIN VALUE:

That stands for an ERROR term of GA. That means if it is near zero,the model is better,and you got GOOD bin/result.

And I guess your Min Value was very big.

The last thing is the number of groups you are binning is too big.

In general , 6 or 7 groups is good enough to get desired result.

Thus, choose 5 or 6 groups to get useful result.

Ronein · Posted 08-21-2024 05:04 AM

Minimum value of the analyzed variable is 0.

Maximum value of the analyzed variable is 698163

I was asked to add also condition that minimum bads in a group is 100.

I run the code on 5 groups and 100 iterations.

Still get same problem

Here is the full code I run:

maybe something in my code is faulty???

%include '!RSMEHOME/SASCode/SHARECode/Libname.sas';

proc delete data=work._all_;Run;

%let var=ithalv_expr_snif ;  *the continuous variable you need to split;
%let group=5 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;

Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad');
IF &var.=. then Var=0;
else Var=round(&Var.,1);
keep VAR   good_bad   KESHEL_FUTURE;
Run;
 
 

proc sql noprint;
select sum(good_bad='bad'),
        sum(good_bad='good'),
        floor(min(var)),ceil(max(var)) 
        into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad   n_good=&n_good   min=&min   max=&max;


proc sort data=temp;
by var ;
run;

 

proc iml;
use temp;
read all var {var good_bad};
close;
start function(x) global(bin,var,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(var,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ; 
good_dist=n_good/&n_good ; 
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;  
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;    
	id=gasetup(2,group-1,123456789);  /***Only for continuous Var (For example- welalth) with integer values****/
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(var,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;

/*If your variable have decimal point, try the following */
/*multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:  */
/*ithalv_expr_snif=100*ithalv_expr_snif;*/



data all_group;
set temp(keep=var rename=(var=b) where=(b is missing)) group;
run;


proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;


data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_var as
 select group label=' ',
min(var) as min label='min',
max(var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='N_dist',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad_dist' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good_dist' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe desc;  /**Worse group first***/
create index group on woe_var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_var ;
quit;
title ' ';

Here is the output

Ksharp · Posted 08-21-2024 05:23 AM

1)Post your Min Value I mentioned above.
I guess it is 999999.
2) You change my code ? Why ? Don't change my code unless you know SAS/IML syntax.
if Bad_Dist>0.05 & Good_Dist>0.05 & n_bad>=100 then
-->
if n_good>0 & n_bad>100 then

2) DO NOT CHANGE MY CODE.

3）Make the following number as big as you can. So you can get what you want.
%let n_iter=10000;
call gainit(id,10000,encoding);

Ksharp · Posted 08-21-2024 05:34 AM

"
Minimum value of the analyzed variable is 0.

Maximum value of the analyzed variable is 698163
"

One more thing:
Post a Histogram Graphic by
proc univariate .....
histogram ithalv_expr_snif ;
run;

And If I was right. you got a HEAVY LONG tail . That is not good for GA.
You can truncate/winsorize it to make my code more efficient and could get you better result. Like:
Data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
......................
if ithalv_expr_snif >20000 then ithalv_expr_snif =20000;

Ksharp · Posted 08-21-2024 05:38 AM

Firstly ,Try to use my original code.
DO NOT change my code !

After you truncated/winsorized variable "ithalv_expr_snif " as I showed above and use my ORIGINAL code , I believe you will get the desired result .

Ksharp · Posted 08-21-2024 08:19 PM

If your histogram graphic is long heavy tail , you should truncated or winsorized your data, like: (mentioned above)

Data temp; 
set r_r.TatPop_raw_Data_Of1_MB1_2; 
...................... 
if ithalv_expr_snif >20000 then ithalv_expr_snif =20000;
run;

Here, 20000 should be 95% percentile of ithalv_expr_snif variable (or less than 95% percentile).

After that, I thing you will get desired output!

Ksharp · Posted 08-21-2024 04:56 AM

BTW, If you really need 10 groups, then you need make number of iterations (like: 10000 or 100000) very big
and number of initial individuals(like:
call gainit(id,10000,encoding);
) very big
to make it happen.
But that would cost you a whole night to run code.

Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

Re: Binning coninous var into categorical var- score card credit risk model

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away