Solved: Binning (categorize continuous var into categories) - Page 2

Ronein · Posted 07-24-2024 01:12 AM

run your code on my data on specific variable (Nr Years in bank )

In your code the best group should be 1 and worse should be 4

I don't see that Per increasing consistently by moving from group 1 to 2 and 2 to 3 and 3 to 4.

Is it not part of the requirement??? To have consistently increasing in per (percentage of bads)???

Ksharp · Posted 07-24-2024 02:32 AM

"I don't see that Per increasing consistently by moving from group 1 to 2 and 2 to 3 and 3 to 4."
You should calcualte it on your own, my code did not include it.
bad_pct=n_bad/n;
====>
group bad_pct
1 0.062
2 0.036
3 0.02
4 0.011

Ronein · Posted 07-24-2024 03:26 AM

The code you send me is running 100%

It provide good results but my friend provide better results via Miner (IGN)

It would be best if can provide such results in SAS code.

By the way- what is the alogorithm name that you used in your code?

Ksharp · Posted 07-24-2024 03:35 AM

"It provide good results but my friend provide better results via Miner (IGN)"
You could try SAS/EM 's ScoreCard module,but that need money to pay.
And I never heard of Miner ,what kind of software it is?

"It would be best if can provide such results in SAS code."
1)You could ask @RobPratt to write SAS/OR code for it ,and also need to pay money to get SAS/OR.
2)Make %let n_iter=10000 ; as big as you can , but that maybe cost you an entire day to get better result.

And make the following 1000 as big as you can. E.X. 10000

call gainit(id,1000,encoding);
--->
call gainit(id,10000,encoding);

"what is the alogorithm name that you used in your code?"
Genetic Algorithm. Check @Rick_SAS blog:
https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html
https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html

Ksharp · Posted 07-24-2024 02:20 AM

"What is the meaning of values "M" and "I"?"
It is means negative infinity and positive infinity .
A.K.A you can map your data into bins by these cutpoints:
low-137='1'
137-144='2'
144-174='3'
174-high='4'

"I also added to final table percent of bads within each group"
And also glad to see you make my code running to get the perfect result . Congratulations!

Ksharp · Posted 07-24-2024 02:27 AM

And about your time costing problem,
You could start 10 sas session ,each session for ONE variable,that could save you lots of time.
P.S. if you could set "%let n_iter=100" as big as you can to get the better IV.

Ksharp · Posted 07-24-2024 02:38 AM

"M" and "I"
represent two missing value .M and .I ,
.M stands for negative infinity,
.I stands for positive infinity.

Ronein · Posted 07-24-2024 06:45 AM

I don't know exactly about miner .
I know that it belong to sas.
Sas enterprise miner

Ronein · Posted 07-24-2024 06:49 AM

My friend told me that SAS/EM 's ScoreCard module is within sas miner

Ksharp · Posted 07-24-2024 08:32 PM

That(Miner) is just the SAS/EM I mentioned before.
It looks like your friend use SAS/EM 's ScoreCard module get the better result.
Actually behind SAS/EM 's ScoreCard module, there are some SAS/OR code to get the better/optimal result . That does not surprise to me .

Except to make the follow number bigger to get better result:

%let n_iter=10000 ;

call gainit(id,10000,encoding)

You also could try to using another ENCODING, since your data contains decimal point, that might get you better result too.

id=gasetup(1,group-1,123456789);

RobPratt · Posted 07-24-2024 11:47 AM

You might be interested in PROC OPTBINNING:

SAS Help Center: OPTBINNING Procedure

Ronein

Hello Ksharp,

I have some questions please:

There are 3 possible cases about the type of predictor variable.

Type1- continuous Var (For example- welalth

Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)

Type3- categorical var(For example: education type)

Can you please tell how to modify your code for each of the cases?

As I understand your code is for Type1

Can you explain

id=gasetup(2,group-1,123456789); is it for type1???

id=gasetup(1,group-1,123456789); is it for type2???

can you exmplain

call gainit(id,1000,encoding);call gainit(id,10000,encoding)

is the number 10000 should be equal to number of iterations??? (%let n_iter=10000)

proc delete data=work._all_;Run;


%let var= ithalv_expr_snif    ;  *the continuous variable you need to split;
%let group=5 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;


data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad ');
if not missing(&var);
keep &var good_bad KESHEL_FUTURE ;
run;


proc sql noprint;
select sum(good_bad='bad'),sum(good_bad='good'),
        floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
from temp;
quit;
%put n_bad=&n_bad   n_good=&n_good   min=&min   max=&max;


proc sort data=temp;
by &var ;
run;


proc iml;
use temp(where=(&var is not missing));
read all var {&var good_bad};
close;
start function(x) global(bin,&var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ; 
good_dist=n_good/&n_good ; 
if Bad_Dist>0.05 & Good_Dist>0.05  then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;  
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;    
id=gasetup(2,group-1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;


data all_group;
set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;


proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;


data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_&var as
 select group label=' ',
min(&var) as min label='min',
max(&var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='Tamhil_N',
sum(good_bad='bad') as n_bad label='n_bad',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='Tamhil_Bads' format=percent8.2,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='Tamhil_Goods' format=percent8.2,
calculated n_bad/calculated N as PCT_BAD label='PCT_Bads' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe desc;  /**Worse group first***/
create index group on woe_&var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;
quit;
title ' ';

Ksharp

"

Type1- continuous Var (For example- welalth

Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)

Type3- categorical var(For example: education type)

Can you please tell how to modify your code for each of the cases?

As I understand your code is for Type1

"

Answer: Yes. My code is for Type1. Specially , my code is for integer value:

id=gasetup(2,group-1,123456789);

here 2 means integer encoding.

If your variable have decimal point, try the following two ways:

(or just using my original code if no need to consider about decimal point )

1) using real number encoding by 1 (I would not recomment due to cost more time than 2)

id=gasetup(1,group-1,123456789);

2)multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:

ithalv_expr_snif=100*ithalv_expr_snif;    
.............
id=gasetup(2,group-1,123456789);

For Type2 , you could try "2)" I mentioned above.

For Type3, you could try the following code:

proc import datafile='D:\ifre_backup\Ksharp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;


%let var=purpose   ;
%let group=5 ;
%let n_iter=100;


data temp;
 set have;
 keep &var good_bad ;
run;


proc sql noprint;
 select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
  from temp;
 create index &var on temp;
quit;
%put &n_bad &n_good ;

proc iml;
use temp ;
read all var {&var good_bad};
close;

start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;

do i=1 to group;
 idx=loc(x=i);
 levels=level[idx];
 index=loc(element(&var,levels));
 temp=good_bad[index];
 n_bad=sum(temp='bad');
 n_good=sum(temp='good');
 bad_dist=n_bad/&n_bad ; 
 good_dist=n_good/&n_good ; 
 if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
  else woe[i]=.;
end;

if countmiss(woe)=0 then do;
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;

end;
else sse=999999;

return (sse);
finish;

group=&group ; 
bin=t(1:group);
woe=j(group,1,.);


level=unique(&var);
n_level=countunique(&var);

encoding=j(2,n_level,1 );
encoding[2,]=group ;    

id=gasetup(2,n_level,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);


niter = &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
create group var{col_mem level};
append;
close;

print value[l = "Min Value:"] ;
call gaend(id);
quit;

data all;
 merge temp group(rename=(level=&var col_mem=group)) ;
 by &var;
run;



title "变量: &var" ;
proc sql;
create table woe_&var as
 select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe;

create index group on woe_&var;
create index col_mem on group;


select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;

quit;
title ' ';




proc sgplot data=woe_&var noautolegend;
 vbar group/response=woe nostatlabel missing;
 vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled 
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS;
run;

"

Can you explain

id=gasetup(2,group-1,123456789); is it for type1???

id=gasetup(1,group-1,123456789); is it for type2???

"

Answer: It is the initial code for GA.

id=gasetup(2,group-1,123456789); is it for type1???

Yes. Specially for integer number.If you don't care about decimal point, you could use my original code.

If you take into account of decimal point(E.X. your variable is a Rate),you could multiply 100 or 1000 or 10000 as "2)" I mentioned above and using my original code .

"

can you exmplain

call gainit(id,1000,encoding);call gainit(id,10000,encoding)

is the number 10000 should be equal to number of iterations??? (%let n_iter=10000)

"

Answer: It is the initial code for GA.

1000 means generate 1000 individuals in the first iteration.The bigger it is ,the better initial value you would get.

No.They would not be equal. They are two different thing . 1000 in call gainit() is the initial number of population , "%let n_iter=1000" is the number of iteration of GA, they could be different. Both are bigger,you could get better result.

Top_Katz

Hi @Ronein et al!

As @Ksharp suggested, you an use PROC OPTMODEL to create monotonic bins that meet your criteria and maximize IV (also known as symmetric Kullback-Leibler Divergence). I posted some code to do that five years ago:
Trying to use PROC OPTMODEL for monotonic supervised optimal binning o... - SAS Support Communities

It uses integer programming and it's not very efficient, and it can take a very long time to run, even on small problems. That is the nature of combinatorial optimization.

I have attached some code that uses a slightly less optimal two-stage approach. The first stage is an isotonic regression, which creates the closest fitting monotonic sequence from the original data; you can think of it as very granular binning, for which you have no control over the number or sizes of the bins. The second stage uses a dynamic programming approach to create the bins with the properties you require (number and sizes of bins, but you'd have to add the constraint on the minimum event number in each bin). The dynamic programming doesn't run into the same computational issues as the integer programming approach, but it can't impose monotonicity; that's why you precede it with the isotonic regression, which forces the subsequent bins to remain monotonic.

The five files run in the following order:

1. _0_import_german_credit.csv_.txt reads in the data

2. _1_agg_gc_credit_amount_.txt aggregates to a single observation for each unique predictor variable value

3. _3_agg_ds_for_optmodel_.txt adds some cumulative variables to the aggregated data

4. _5_isoreg_decrease_data_step_.txt runs the isotonic regression on the aggregated data

5. _100_gcdp011_post_isoreg_4+-gps_40-minsz_ivx_.txt runs the dynamic programming algorithm on the output of the isotonic regression. In this case, it seeks at least four bins with a minimum size of forty records, and maximum IV.

I hope you find this helpful.

Ronein

Hi Ksharp

You mentioned that if my var type is continuous with decimal (for example ratio var: Ratio height to weight)

then need to multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:
new_variable=100*variable;

My question:

Is it also need to use round function to verify that my var has no decimal point?

Is the action of multiply by 100 will remove the decimals ??? or need also use round???

new_variable_2 = round(new_variable, 1);

as I see need also use round function.


%let var= y_AvgNizOsh_MisgOsh ;  *the continuous variable you need to split;
%let group=5 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;


data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
good_bad=ifc(KESHEL_FUTURE=0,'good','bad ');
if not missing(&var);
VAR=round(100*&var.,1);
keep &var. VAR good_bad KESHEL_FUTURE ;
run;

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Re: Binning (categorize continuous var into categories)

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!