Re: Binning categorical variavle

Ronein · Posted 09-06-2024 10:36 AM

Hello
Let's say that I have a categorical varaible that has 20 categories.
I want to perform binning to 5 categories.
The best binning is the binning that maximize information value.
Can anyone send a sas code that do it ?
Remember that the variable that I want to perform binning is categorical .
Please note that I need it for logistic regression.
Please note that the 20 categories of the variable were created after combine 3 categorical variables

Thanks

ballardw · Posted 09-06-2024 11:27 AM

Define how we would recognize "maximize information value".

Details are very likely data dependent.

You might look at clustering or similar tools for identifying observations that are similar and see if that gives you usable bins.

Custom format(s) would likely be the easiest tool to implement for use as you wouldn't have to change your data at all just use the format with the variable in the analysis or reporting procedures.

Ronein · Posted 09-06-2024 01:37 PM

It is used for credit risk model.

I have 3 categorical predictor variables and I need to combine them and then perform binning to get 5 categories.

The criteria for choosing the best binning is:

1- The worse category is 1 with the highest WOE

2-The best category is 5 with lowest WOE

3-There is ordinal gap between categories (it means that WOE for each category is lower than previous)

4- In each category at least 50 bads customers

As I said the binning method is maximizing the information value (IV)

Can anyone show a code?

ballardw · Posted 09-06-2024 01:44 PM

How can you have "lowest" or "highest" anything if the variables are character???

Sorry, I don't speak credit risk. Maybe someone else can make guesses about what your data looks like that make sense.

Ronein · Posted 09-06-2024 01:59 PM

Sorry,

I have a var with 20 nominal categories.

I need to find best binning way to 5 groups

mkeintz · Posted 09-06-2024 02:11 PM

@Ronein wrote:

Sorry,

I have a var with 20 nominal categories.

I need to find best binning way to 5 groups

Do you have a continuous outcome measure, an ordinal outcome measure, or a nominal outcome measure? If it's the later, how do you define the concept of maximizing information value? Please make the structure of your objective clear.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 09-06-2024 02:18 PM

Binary outcome dependent varaible get value 1 or 0(1-fail 0-no fail)

mkeintz · Posted 09-06-2024 02:35 PM

So you have frequencies of outcome=0 and outcome=1 for some 20 bins, yes?

And you want to maximize "information" by collapsing those 20 bins to 5 bins, right?

What's your definition of information? Chi-square? Entropy? I see several alternatives in Wikipedia's entry for Quantities of information.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 09-06-2024 04:21 PM

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

mkeintz · Posted 09-07-2024 01:52 AM

And are the 20 bins you already have adjacent in any way? I ask this because your original posting said:

Please note that the 20 categories of the variable were created after combine 3 categorical variables

So if the 3 original categorical variables had adjacent categories to begin with, it may be that the 20 derivative categories had some type of adjacency. For example, if you had 3 variables, each with a L (low) M (medium) and H (high) category, you could end up with 27 combinations from LLL, LLM, LLH, ..... HHH. Would you be content in collapsing the number of bins in a way that included combining LLL with HHH? That seems a bit inconsistent with the article you cited, but it could happen if you treat your 20 bins as strictly nominal.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 09-07-2024 02:33 AM

Yes .combine categories with similar WOE1

Ksharp · Posted 09-07-2024 03:05 AM

Here is the GA code,but that did not guarrentee you get the GLOBAL OPTIMAL solution .

I suggest you post your question at SAS/OR forum and calling @RobPratt

https://communities.sas.com/t5/Mathematical-Optimization/bd-p/operations_research


%let var=purpose   ;
%let group=5 ;
%let n_iter=100;


data temp;
 set have;
 keep &var good_bad ;
run;


proc sql noprint;
 select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
  from temp;
 create index &var on temp;
quit;
%put &n_bad &n_good ;

proc iml;
use temp ;
read all var {&var good_bad};
close;

start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;

do i=1 to group;
 idx=loc(x=i);
 levels=level[idx];
 index=loc(element(&var,levels));
 temp=good_bad[index];
 n_bad=sum(temp='bad');
 n_good=sum(temp='good');
 bad_dist=n_bad/&n_bad ; 
 good_dist=n_good/&n_good ; 
 if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=(Bad_Dist-Good_Dist)#log(Bad_Dist/Good_Dist);
  else woe[i]=.;
end;

if countmiss(woe)=0 then iv=sum(woe) ;


else iv=-999999;

end;
else iv=-999999;

return (iv);
finish;

group=&group ; 
bin=t(1:group);
woe=j(group,1,.);


level=unique(&var);
n_level=countunique(&var);

encoding=j(2,n_level,1 );
encoding[2,]=group ;    

id=gasetup(2,n_level,123456789);
call gasetobj(id,1,"function");
call gasetsel(id,10,1,1);
call gainit(id,10000,encoding);


niter = &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
create group var{col_mem level};
append;
close;

print value[l = "IV Max Value:"] ;
call gaend(id);
quit;

data all;
 merge temp group(rename=(level=&var col_mem=group)) ;
 by &var;
run;



title "变量: &var" ;
proc sql;
create table woe_&var as
 select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist  format=percent7.2 label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist  format=percent7.2 label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe;

create index group on woe_&var;
create index col_mem on group;


select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;

quit;
title ' ';

data fmt_&var(index=(group)) ;
 merge group(rename=(col_mem=group)) woe_&var(keep=group woe);
 by group;
 retain fmtname "&var" type 'I';
 rename level=start woe=label;
run;

proc sgplot data=woe_&var noautolegend;
 vbar group/response=woe nostatlabel missing;
 vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled 
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS;
xaxis discreteorder=data;
run;

Ronein · Posted 09-07-2024 03:24 AM

Thank you.
Should the varaible we perform binning on it be categorical nominal or can also be categorical char?

Ksharp · Posted 09-07-2024 03:44 AM

Could be both of them.
Just make sure this variable value is discrete ,never mind it is numeric or character type.

NOTE: the code has been changed. The 'good_bad' variable should have only two values : 'good' or 'bad' .

Ronein · Posted 09-07-2024 07:18 AM

How can you see in output which categories are in each bin ?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away