Define how we would recognize "maximize information value".
Details are very likely data dependent.
You might look at clustering or similar tools for identifying observations that are similar and see if that gives you usable bins.
Custom format(s) would likely be the easiest tool to implement for use as you wouldn't have to change your data at all just use the format with the variable in the analysis or reporting procedures.
It is used for credit risk model.
I have 3 categorical predictor variables and I need to combine them and then perform binning to get 5 categories.
The criteria for choosing the best binning is:
1- The worse category is 1 with the highest WOE
2-The best category is 5 with lowest WOE
3-There is ordinal gap between categories (it means that WOE for each category is lower than previous)
4- In each category at least 50 bads customers
As I said the binning method is maximizing the information value (IV)
Can anyone show a code?
How can you have "lowest" or "highest" anything if the variables are character???
Sorry, I don't speak credit risk. Maybe someone else can make guesses about what your data looks like that make sense.
Sorry,
I have a var with 20 nominal categories.
I need to find best binning way to 5 groups
@Ronein wrote:
Sorry,
I have a var with 20 nominal categories.
I need to find best binning way to 5 groups
Do you have a continuous outcome measure, an ordinal outcome measure, or a nominal outcome measure? If it's the later, how do you define the concept of maximizing information value? Please make the structure of your objective clear.
So you have frequencies of outcome=0 and outcome=1 for some 20 bins, yes?
And you want to maximize "information" by collapsing those 20 bins to 5 bins, right?
What's your definition of information? Chi-square? Entropy? I see several alternatives in Wikipedia's entry for Quantities of information.
And are the 20 bins you already have adjacent in any way? I ask this because your original posting said:
Please note that the 20 categories of the variable were created after combine 3 categorical variables
So if the 3 original categorical variables had adjacent categories to begin with, it may be that the 20 derivative categories had some type of adjacency. For example, if you had 3 variables, each with a L (low) M (medium) and H (high) category, you could end up with 27 combinations from LLL, LLM, LLH, ..... HHH. Would you be content in collapsing the number of bins in a way that included combining LLL with HHH? That seems a bit inconsistent with the article you cited, but it could happen if you treat your 20 bins as strictly nominal.
Here is the GA code,but that did not guarrentee you get the GLOBAL OPTIMAL solution .
I suggest you post your question at SAS/OR forum and calling @RobPratt
https://communities.sas.com/t5/Mathematical-Optimization/bd-p/operations_research
%let var=purpose ;
%let group=5 ;
%let n_iter=100;
data temp;
set have;
keep &var good_bad ;
run;
proc sql noprint;
select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
from temp;
create index &var on temp;
quit;
%put &n_bad &n_good ;
proc iml;
use temp ;
read all var {&var good_bad};
close;
start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;
do i=1 to group;
idx=loc(x=i);
levels=level[idx];
index=loc(element(&var,levels));
temp=good_bad[index];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=(Bad_Dist-Good_Dist)#log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then iv=sum(woe) ;
else iv=-999999;
end;
else iv=-999999;
return (iv);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
level=unique(&var);
n_level=countunique(&var);
encoding=j(2,n_level,1 );
encoding[2,]=group ;
id=gasetup(2,n_level,123456789);
call gasetobj(id,1,"function");
call gasetsel(id,10,1,1);
call gainit(id,10000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
create group var{col_mem level};
append;
close;
print value[l = "IV Max Value:"] ;
call gaend(id);
quit;
data all;
merge temp group(rename=(level=&var col_mem=group)) ;
by &var;
run;
title "变量: &var" ;
proc sql;
create table woe_&var as
select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist format=percent7.2 label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist format=percent7.2 label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe;
create index group on woe_&var;
create index col_mem on group;
select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_&var ;
quit;
title ' ';
data fmt_&var(index=(group)) ;
merge group(rename=(col_mem=group)) woe_&var(keep=group woe);
by group;
retain fmtname "&var" type 'I';
rename level=start woe=label;
run;
proc sgplot data=woe_&var noautolegend;
vbar group/response=woe nostatlabel missing;
vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled
size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
FILLEDOUTLINEDMARKERS;
xaxis discreteorder=data;
run;
Could be both of them.
Just make sure this variable value is discrete ,never mind it is numeric or character type.
NOTE: the code has been changed. The 'good_bad' variable should have only two values : 'good' or 'bad' .
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.