☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Meteorite | Level 14

## Re: Binning (categorize continuous var into categories)

run your code on my data on specific variable (Nr Years in bank )

In your code the best group should be 1 and worse should be 4

I don't see that Per increasing consistently by moving from group 1 to 2 and 2 to 3 and 3 to 4.

Is it not part of the requirement??? To have consistently increasing in per (percentage of bads)???

Super User

## Re: Binning (categorize continuous var into categories)

"I don't see that Per increasing consistently by moving from group 1 to 2 and 2 to 3 and 3 to 4."
You should calcualte it on your own, my code did not include it.
====>
1 0.062
2 0.036
3 0.02
4 0.011
Meteorite | Level 14

## Re: Binning (categorize continuous var into categories)

The code you send me is running 100%

It provide good results but my friend provide better results via Miner (IGN)

It would be best if  can provide such results in SAS code.

By the way- what is the alogorithm name that you used in your code?

Super User

## Re: Binning (categorize continuous var into categories)

"It provide good results but my friend provide better results via Miner (IGN)"
You could try SAS/EM 's ScoreCard module,but that need money to pay.
And I never heard of Miner ,what kind of software it is?

"It would be best if can provide such results in SAS code."
1)You could ask @RobPratt to write SAS/OR code for it ,and also need to pay money to get SAS/OR.
2)Make %let n_iter=10000 ; as big as you can , but that maybe cost you an entire day to get better result.

And make the following 1000 as big as you can. E.X. 10000

``````call gainit(id,1000,encoding);
--->
call gainit(id,10000,encoding);``````

"what is the alogorithm name that you used in your code?"
Genetic Algorithm. Check @Rick_SAS blog:
https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html
https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html

Super User

## Re: Binning (categorize continuous var into categories)

"What is the meaning of values "M" and "I"?"
It is means negative infinity and positive infinity .
A.K.A you can map your data into bins by these cutpoints:
low-137='1'
137-144='2'
144-174='3'
174-high='4'

"I also added to final table percent of bads within each group"
And also glad to see you make my code running to get the perfect result . Congratulations!
Super User

## Re: Binning (categorize continuous var into categories)

You could start 10 sas session ,each session for ONE variable,that could save you lots of time.
P.S. if you could set "%let n_iter=100" as big as you can to get the better IV.
Super User

## Re: Binning (categorize continuous var into categories)

"M" and "I"
represent two missing value .M and .I ,
.M stands for negative infinity,
.I stands for positive infinity.
Meteorite | Level 14

## Re: Binning (categorize continuous var into categories)

I don't know exactly about miner .
I know that it belong to sas.
Sas enterprise miner
Meteorite | Level 14

## Re: Binning (categorize continuous var into categories)

My friend told me that SAS/EM 's ScoreCard module is within sas miner
Super User

## Re: Binning (categorize continuous var into categories)

That(Miner) is just the SAS/EM I mentioned before.
It looks like your friend use SAS/EM 's ScoreCard module get the better result.
Actually behind SAS/EM 's ScoreCard module, there are some SAS/OR code to get the better/optimal result . That does not surprise to me .

Except to make the follow number bigger to get better result:

%let n_iter=10000 ;

call gainit(id,10000,encoding)

You also could try to using another ENCODING, since your data contains decimal point, that might get you better result too.

id=gasetup(1,group-1,123456789);

SAS Super FREQ

## Re: Binning (categorize continuous var into categories)

You might be interested in PROC OPTBINNING:

SAS Help Center: OPTBINNING Procedure

Meteorite | Level 14

## Re: Binning (categorize continuous var into categories)

Hello Ksharp,

I have some questions please:

There are 3 possible cases about the type of predictor variable.

Type1- continuous Var (For example- welalth

Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)

Type3- categorical var(For example: education type)

Can you please tell how to modify your code for each of the cases?

As I understand your code is for Type1

Can you explain

id=gasetup(2,group-1,123456789);    is it for type1???

id=gasetup(1,group-1,123456789);    is it for type2???

can you exmplain

call gainit(id,1000,encoding);call gainit(id,10000,encoding)

is the number 10000 should be equal to number of iterations???  (%let n_iter=10000)

``````proc delete data=work._all_;Run;

%let var= ithalv_expr_snif    ;  *the continuous variable you need to split;
%let group=5 ;    *the number of group you want to bin to;
%let n_iter=100;  *the number of iteration for Genetic Algorithm to get best WOE IV;

data temp;
set r_r.TatPop_raw_Data_Of1_MB1_2;
if not missing(&var);
keep &var good_bad KESHEL_FUTURE ;
run;

proc sql noprint;
floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
from temp;
quit;

proc sort data=temp;
by &var ;
run;

proc iml;
use temp(where=(&var is not missing));
close;
start function(x) global(bin,&var ,good_bad,group,woe);
if countunique(x)=group-1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');
if countunique(b)=group  then do;
do i=1 to group;
idx=loc(b=i);
n_good=sum(temp='good');
good_dist=n_good/&n_good ;
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;
id=gasetup(2,group-1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter =  &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;

data all_group;
set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;

proc sql noprint;
select count(distinct b) as nr_groups into :nr_groups
from all_group
;
quit;
%put &nr_groups;

data all;
merge all_group temp;
/*b=(&nr_groups.-b)+1; */
/***I want that worse group be group 1 and so on****/
rename b=group;
Run;

title;
title "&var" ;
proc sql;
create table woe_&var as
select group label=' ',
min(&var) as min label='min',
max(&var) as max label='max',
count(*) as n label='N',
calculated n/(select count(*) from all) as per format=percent7.2 label='Tamhil_N',
sum(good_bad='good') as n_good label='n_good',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='Tamhil_Goods' format=percent8.2,
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe desc;  /**Worse group first***/
create index group on woe_&var;
select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
from woe_&var ;
quit;
title ' ';

``````
Super User

## Re: Binning (categorize continuous var into categories)

"

Type1- continuous Var (For example- welalth

Type2- Ratio VAR(contain decimal point. for example: Ratio income to debt)

Type3- categorical var(For example: education type)

Can you please tell how to modify your code for each of the cases?

As I understand your code is for Type1

"

Answer: Yes. My code is for Type1. Specially , my code is for integer value:

``id=gasetup(2,group-1,123456789);``

here 2 means integer encoding.

If your variable have decimal point, try the following two ways:

(or just using my original code if  no need to consider about decimal point )

1) using real number encoding by 1 (I would not recomment due to cost more time than 2)

``id=gasetup(1,group-1,123456789);``

2)multiply your variable by 100 or 1000 and using my original code( integer encoding by 2) like:

``````ithalv_expr_snif=100*ithalv_expr_snif;
.............
id=gasetup(2,group-1,123456789);``````

For Type2 , you could try "2)" I mentioned above.

For Type3, you could try the following code:

``````proc import datafile='D:\ifre_backup\Ksharp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;

%let var=purpose   ;
%let group=5 ;
%let n_iter=100;

data temp;
set have;
keep &var good_bad ;
run;

proc sql noprint;
from temp;
create index &var on temp;
quit;
%put &n_bad &n_good ;

proc iml;
use temp ;
close;

start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;

do i=1 to group;
idx=loc(x=i);
levels=level[idx];
index=loc(element(&var,levels));
n_good=sum(temp='good');
good_dist=n_good/&n_good ;
else woe[i]=.;
end;

if countmiss(woe)=0 then do;
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;

end;
else sse=999999;

return (sse);
finish;

group=&group ;
bin=t(1:group);
woe=j(group,1,.);

level=unique(&var);
n_level=countunique(&var);

encoding=j(2,n_level,1 );
encoding[2,]=group ;

id=gasetup(2,n_level,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
create group var{col_mem level};
append;
close;

print value[l = "Min Value:"] ;
call gaend(id);
quit;

data all;
merge temp group(rename=(level=&var col_mem=group)) ;
by &var;
run;

title "变量: &var" ;
proc sql;
create table woe_&var as
select group,
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe;

create index group on woe_&var;
create index col_mem on group;

select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
from woe_&var ;

quit;
title ' ';

proc sgplot data=woe_&var noautolegend;
vbar group/response=woe nostatlabel missing;
vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled
size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
FILLEDOUTLINEDMARKERS;
run;

``````

"

Can you explain

id=gasetup(2,group-1,123456789);    is it for type1???

id=gasetup(1,group-1,123456789);    is it for type2???

"

Answer: It is the initial code for GA.

id=gasetup(2,group-1,123456789);    is it for type1???

Yes. Specially for integer number.If you don't care about decimal point, you could use my original code.

If you take into account of decimal point(E.X. your variable is a Rate),you could multiply 100 or 1000 or 10000 as "2)" I mentioned above and using my original code .

"

can you exmplain

call gainit(id,1000,encoding);call gainit(id,10000,encoding)

is the number 10000 should be equal to number of iterations???  (%let n_iter=10000)

"

Answer: It is the initial code for GA.

1000 means generate 1000 individuals in the first iteration.The bigger it is ,the better initial value you would get.

No.They would not be equal. They are two different thing . 1000 in call gainit() is the initial number of population , "%let n_iter=1000" is the number of iteration of GA, they could be different. Both are bigger,you could get better result.

Quartz | Level 8

## Re: Binning (categorize continuous var into categories)

Hi @Ronein  et al!

As @Ksharp  suggested, you an use PROC OPTMODEL to create monotonic bins that meet your criteria and maximize IV (also known as symmetric Kullback-Leibler Divergence).  I posted some code to do that five years ago:
Trying to use PROC OPTMODEL for monotonic supervised optimal binning o... - SAS Support Communities

It uses integer programming and it's not very efficient, and it can take a very long time to run, even on small problems.  That is the nature of combinatorial optimization.

I have attached some code that uses a slightly less optimal two-stage approach.  The first stage is an isotonic regression, which creates the closest fitting monotonic sequence from the original data; you can think of it as very granular binning, for which you have no control over the number or sizes of the bins.  The second stage uses a dynamic programming approach to create the bins with the properties you require (number and sizes of bins, but you'd have to add the constraint on the minimum event number in each bin).  The dynamic programming doesn't run into the same computational issues as the integer programming approach, but it can't impose monotonicity; that's why you precede it with the isotonic regression, which forces the subsequent bins to remain monotonic.

The five files run in the following order:

1. _0_import_german_credit.csv_.txt reads in the data

2. _1_agg_gc_credit_amount_.txt aggregates to a single observation for each unique predictor variable value

3. _3_agg_ds_for_optmodel_.txt adds some cumulative variables to the aggregated data

4. _5_isoreg_decrease_data_step_.txt runs the isotonic regression on the aggregated data

5. _100_gcdp011_post_isoreg_4+-gps_40-minsz_ivx_.txt runs the dynamic programming algorithm on the output of the isotonic regression.  In this case, it seeks at least four bins with a minimum size of forty records, and maximum IV.

I hope you find this helpful.

Discussion stats
• 28 replies
• 874 views
• 8 likes
• 4 in conversation