BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14
Hello
Let's say that I have a categorical varaible that has 20 categories.
I want to perform binning to 5 categories.
The best binning is the binning that maximize information value.
Can anyone send a sas code that do it ?
Remember that the variable that I want to perform binning is categorical .
Please note that I need it for logistic regression.
Please note that the 20 categories of the variable were created after combine 3 categorical variables

Thanks
19 REPLIES 19
ballardw
Super User

Define how we would recognize "maximize information value".

 

Details are very likely data dependent.

 

You might look at clustering or similar tools for identifying observations that are similar and see if that gives you usable bins.

 

Custom format(s) would likely be the easiest tool to implement for use as you wouldn't have to change your data at all just use the format with the variable in the analysis or reporting procedures.

Ronein
Meteorite | Level 14

It is used for credit risk model.

I have 3 categorical predictor variables and I need to combine them and then perform binning to get 5 categories.

The criteria for choosing the best binning is:

1- The worse category is 1 with the highest WOE

2-The best category is  5 with lowest WOE

3-There is ordinal gap between categories (it means that WOE for each category is lower than previous)

4- In each category at least   50 bads customers

As I said the binning method is maximizing the information value (IV)

Can anyone show a code? 

 

ballardw
Super User

How can you have "lowest" or "highest" anything if the variables are character???

 

Sorry, I don't speak credit risk. Maybe someone else can make guesses about what your data looks like that make sense.

Ronein
Meteorite | Level 14

Sorry,

I have a var with 20 nominal categories.

I need to find best binning way to 5 groups

mkeintz
PROC Star

@Ronein wrote:

Sorry,

I have a var with 20 nominal categories.

I need to find best binning way to 5 groups


Do you have a continuous outcome measure, an ordinal outcome measure, or a nominal outcome measure?  If it's the later, how do you define the concept of maximizing information value?   Please make the structure of your objective clear.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Meteorite | Level 14
Binary outcome dependent varaible get value 1 or 0(1-fail 0-no fail)
mkeintz
PROC Star

So you have frequencies of outcome=0 and outcome=1 for some 20 bins, yes?

 

And you want to maximize "information" by collapsing those 20 bins to 5 bins, right?

 

What's your definition of information?      Chi-square?  Entropy?  I see several alternatives in Wikipedia's entry for Quantities of information.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

And are the 20 bins you already have adjacent in any way?  I ask this because your original posting said:

 

Please note that the 20 categories of the variable were created after combine 3 categorical variables

So if the 3 original categorical variables had adjacent categories to begin with, it may be that the 20 derivative categories had some type of adjacency.  For example, if you had 3 variables, each with a L (low) M (medium) and H (high) category, you could end up with 27 combinations from LLL, LLM, LLH, ..... HHH.  Would you be content in collapsing the number of bins in a way that included combining LLL with HHH?   That seems a bit inconsistent with the article you cited, but it could happen if you treat your 20 bins as strictly nominal.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Meteorite | Level 14
Yes .combine categories with similar WOE1
Ksharp
Super User

Here is the GA code,but that did not guarrentee you get the GLOBAL OPTIMAL solution  .

I suggest you post your question at SAS/OR forum and calling @RobPratt 

https://communities.sas.com/t5/Mathematical-Optimization/bd-p/operations_research


%let var=purpose   ;
%let group=5 ;
%let n_iter=100;


data temp;
 set have;
 keep &var good_bad ;
run;


proc sql noprint;
 select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
  from temp;
 create index &var on temp;
quit;
%put &n_bad &n_good ;

proc iml;
use temp ;
read all var {&var good_bad};
close;

start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;

do i=1 to group;
 idx=loc(x=i);
 levels=level[idx];
 index=loc(element(&var,levels));
 temp=good_bad[index];
 n_bad=sum(temp='bad');
 n_good=sum(temp='good');
 bad_dist=n_bad/&n_bad ; 
 good_dist=n_good/&n_good ; 
 if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=(Bad_Dist-Good_Dist)#log(Bad_Dist/Good_Dist);
  else woe[i]=.;
end;

if countmiss(woe)=0 then iv=sum(woe) ;


else iv=-999999;

end;
else iv=-999999;

return (iv);
finish;

group=&group ; 
bin=t(1:group);
woe=j(group,1,.);


level=unique(&var);
n_level=countunique(&var);

encoding=j(2,n_level,1 );
encoding[2,]=group ;    

id=gasetup(2,n_level,123456789);
call gasetobj(id,1,"function");
call gasetsel(id,10,1,1);
call gainit(id,10000,encoding);


niter = &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
create group var{col_mem level};
append;
close;

print value[l = "IV Max Value:"] ;
call gaend(id);
quit;

data all;
 merge temp group(rename=(level=&var col_mem=group)) ;
 by &var;
run;



title "变量: &var" ;
proc sql;
create table woe_&var as
 select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist  format=percent7.2 label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist  format=percent7.2 label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe;

create index group on woe_&var;
create index col_mem on group;


select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;

quit;
title ' ';

data fmt_&var(index=(group)) ;
 merge group(rename=(col_mem=group)) woe_&var(keep=group woe);
 by group;
 retain fmtname "&var" type 'I';
 rename level=start woe=label;
run;

proc sgplot data=woe_&var noautolegend;
 vbar group/response=woe nostatlabel missing;
 vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled 
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS;
xaxis discreteorder=data;
run;

Ksharp_0-1725694915686.pngKsharp_1-1725694927477.png

 

 

Ronein
Meteorite | Level 14
Thank you.
Should the varaible we perform binning on it be categorical nominal or can also be categorical char?
Ksharp
Super User

Could be both of them.
Just make sure this variable value is discrete ,never mind it is numeric or character type.

 

NOTE: the code has been changed. The 'good_bad' variable should have only two values : 'good'  or 'bad'  .

Ronein
Meteorite | Level 14
How can you see in output which categories are in each bin ?

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 889 views
  • 1 like
  • 4 in conversation