Solved: Optimal Binning and WOE for binary target variable

dolldata · Posted 10-03-2017 12:40 PM

We are currently struggling with Optimal Binning in Enterprise Miner. Though we have used base SAS for years, when modeling, we always did our binning and variable selection within the Angoss Knowledge Studio program, but recently converted to Enterprise Miner (version 14.1).

In Angoss, through Optimal Binning and Weight of Evidence (WOE), the software essentially built a decision tree for each variable, basing the bins on the variable's relation to the binary target variable. The software chose the optimal number of bins, but we could go in and tweak them as we chose. Then we could code a single variable from the bins, where the values were the weight of evidence. For example:

/*Sample of a binned variable based on WOE*/
     if transactions <   0 then bin_transactions = -0.0617;
else if transactions <   1 then bin_transactions = -0.4566;
else if transactions <   2 then bin_transactions = -0.0617;
else if transactions <   7 then bin_transactions =  0.1214;
else if transactions <  16 then bin_transactions =  0.3625;
else if transactions <  27 then bin_transactions =  0.5316;
else if transactions >= 27 then bin_transactions =  0.9163;

We would then code these binned variables into base SAS and, along with other top variables we had selected, run a PROC LOGISTIC to create our model, which would result in a single coefficient for the binned variable.

/*Reduced example of the model equation (for inclusion in logistic conversion)*/
Y1 = - 0.00131*recency            
     + 0.2438*bin_transactions   /*binned variable*/
     + 0.1007*grandparent     
     - 1.5645;

p = 1/(1 + EXP(-Y1));

Not the most efficient method, but it created beautiful models! Using Enterprise Miner for the entire process, including the regression, we can only seem to create indicator variables for each bin within a variable, and then have a coefficient for each bin. So the transactions variable referenced above would now be seven unique variables, all with their own coefficients, which doesn't seem ideal.

Is there any way to get this same type of information within Enterprise Miner? After reading threads here, and after discussions we originally had with SAS when converting, we haven't really found a way to do this.

DougWielenga · Posted 10-04-2017 12:01 PM

(note: edited to clarify the need for SAS Enterprise Miner Credit Scoring add-on for the Interactive Grouping node)

Dolldata,

Thanks for your inquiry. SAS Enterprise Miner was designed to simplify the highly manual process of modeling including actions like splitting, imputing, transforming, binning, and calculating WOE for models. If you wish for SAS Enterprise Miner to compute WOE (weight of evidence) values and produce scorecard type output, you will need the SAS Credit Scoring add-on for SAS Enterprise Miner. This add-on provides an additional set of nodes including the Interactive Grouping node which allows the user to interactively edit how the bins are being created while maintaining the other desirable features of binning such as making sure the bins are monotonically increasing or decreasing. Most modeling processes including the Credit Scoring models can be deployed using only Base SAS since the score code that is generated is data step code. For regression models, you will not need a license for SAS Enterprise Miner to use the score code it produces on a server running Base SAS.

Regarding the simplicity of the code you mentioned -- while the code looks more efficient in its final state using your previous approach, it is not fair to compare only the final appearance of the score code. The SAS Enterprise Miner code is generated automatically after you have created and finalized bins through an interactive interface. You are not required to translate the final binned values into score code so the amount of work overall is typically far less with the price being the apparent complexity of the resulting score code. The user is not expected to edit the score code generated by SAS Enterprise Miner manually (and doing so is actually discouraged) since that work is handled automatically. These are simple binning transformations which are not computationally intensive so any additional run time is likely small in relation to the amount of manual effort you are saved by this approach which does not require modifying the score code. Should you wish to import bins created previously into a new data set with one or more previously analyzed variables, you can do so easily making the SAS Enterprise Miner process easier to translate to multiple modeling scenarios. Again, this scoring can be deployed on a system running only Base SAS for most modeling scenarios including those run by the SAS Credit Scoring nodes.

If this doesn't alleviate your concern, please help me to understand why the simpler score code is preferable and I will try and see if there are ways to accomplish those same goals.

Thanks!

Doug

View solution in original post

DougWielenga · Posted 10-04-2017 12:01 PM

(note: edited to clarify the need for SAS Enterprise Miner Credit Scoring add-on for the Interactive Grouping node)

Dolldata,

Thanks for your inquiry. SAS Enterprise Miner was designed to simplify the highly manual process of modeling including actions like splitting, imputing, transforming, binning, and calculating WOE for models. If you wish for SAS Enterprise Miner to compute WOE (weight of evidence) values and produce scorecard type output, you will need the SAS Credit Scoring add-on for SAS Enterprise Miner. This add-on provides an additional set of nodes including the Interactive Grouping node which allows the user to interactively edit how the bins are being created while maintaining the other desirable features of binning such as making sure the bins are monotonically increasing or decreasing. Most modeling processes including the Credit Scoring models can be deployed using only Base SAS since the score code that is generated is data step code. For regression models, you will not need a license for SAS Enterprise Miner to use the score code it produces on a server running Base SAS.

Regarding the simplicity of the code you mentioned -- while the code looks more efficient in its final state using your previous approach, it is not fair to compare only the final appearance of the score code. The SAS Enterprise Miner code is generated automatically after you have created and finalized bins through an interactive interface. You are not required to translate the final binned values into score code so the amount of work overall is typically far less with the price being the apparent complexity of the resulting score code. The user is not expected to edit the score code generated by SAS Enterprise Miner manually (and doing so is actually discouraged) since that work is handled automatically. These are simple binning transformations which are not computationally intensive so any additional run time is likely small in relation to the amount of manual effort you are saved by this approach which does not require modifying the score code. Should you wish to import bins created previously into a new data set with one or more previously analyzed variables, you can do so easily making the SAS Enterprise Miner process easier to translate to multiple modeling scenarios. Again, this scoring can be deployed on a system running only Base SAS for most modeling scenarios including those run by the SAS Credit Scoring nodes.

If this doesn't alleviate your concern, please help me to understand why the simpler score code is preferable and I will try and see if there are ways to accomplish those same goals.

Thanks!

Doug

dolldata · Posted 05-11-2018 12:05 PM

Hi Doug,

I know it's been awhile since your response, but we've still been working through this issue. I realize that doing the entire process in Miner is simpler and preferred. However, we only have a desktop version of Miner and are not licensed to connect it to our server. Therefore, scoring our entire database with Miner is not an option at this point. That is why we want to be able to pull the binning information out, so that we can code it back into Base SAS, where we are connected to our server, and can then score the full database. I also realize that for many, using Miner as a bit of a black box is ideal, but we have to be able to explain our models in detail, so it's rather hard to do when we can't see much of the information.

Recently, I have been working in the Interactive Binning Node and like the capabilities within it, except of course for still not being able to see the WOE information. The results do create the graph with the Calculated WOE / New WOE, but I've yet to figure out how to see what those actual values are (the graph only details the number of events for each bin, not the WOE values). Is there some output that tells me these values?

Ksharp · Posted 10-05-2017 04:07 AM

My code can't give you "this same type of information within Enterprise Miner".

But my code can give you the better WOE by Genetic Algorithm under SAS/IML .

I don't know if you need my code . Browse it if you would like to better WOE.

My code take the famous German Credit Card data as an example.

/*
proc import datafile='c:\temp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;
*/

%let var=history;
%let group=6 ;
%let n_iter=100;


data temp;
 set have;
 keep &var good_bad ;
run;


proc sql noprint;
 select sum(good_bad='bad'),sum(good_bad='good'),
        floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
  from temp;
quit;
%put &n_bad &n_good &min &max;
proc sort data=temp;by &var ;run;
proc iml;
use temp(where=(&var is not missing));
read all var {&var good_bad};
close;

start function(x) global(bin,&var ,good_bad,group,woe);
if countunique(x)=group-1 then do;

col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');

if countunique(b)=group  then do;
do i=1 to group;
 idx=loc(b=i);
 temp=good_bad[idx];
 n_bad=sum(temp='bad');
 n_good=sum(temp='good');
 bad_dist=n_bad/&n_bad ; 
 good_dist=n_good/&n_good ; 
 if Bad_Dist>0.05 & Good_Dist>0.05  then woe[i]=log(Bad_Dist/Good_Dist);
  else woe[i]=.;
end;

if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;

end;
else sse=999999;

end;
else sse=999999;

return (sse);
finish;

group=&group ;  
bin=t(1:group);
woe=j(group,1,.);



encoding=j(2,group-1,&min );
encoding[2,]=&max ;    

id=gasetup(2,group-1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);


niter =  &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');

create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;

print value[l = "Min Value:"] ;
call gaend(id);
quit;


data all_group;
 set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;
data all;
 merge all_group temp;
 rename b=group;
run;




title "变量: &var" ;
proc sql;
create table woe_&var as
 select group label=' ',
min(&var) as min label='最小值',max(&var) as max label='最大值',count(*) as n label='频数',
calculated n/(select count(*) from all) as per format=percent7.2 label='占比',
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe;

create index group on woe_&var;

select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;

quit;
title ' ';





data fmt_&var ;
 set cutpoints;
 start=lag(cutpoints);
 end=cutpoints;
 if start=.M then hlo='IL';
 if end=.I then hlo='IH';
 if _n_ ne 1 then do;group+1;output;end;
run;
data fmt_&var(index=(group));
 merge  fmt_&var woe_&var(keep=group woe);
 by group;
 retain fmtname "&var" type 'I';
 keep group fmtname type start end woe hlo;
 rename woe=label;
 label group=' ';
run;
proc format cntlin=fmt_&var library=z;
run;


/*
proc print data=woe_&var noobs label;run;
proc sgplot data=woe_&var;
reg y=group x=woe/degree=2 cli clm jitter;
run;
*/
proc sgplot data=woe_&var noautolegend;
 vbar group/response=woe nostatlabel missing;
 vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled 
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS;
run;

ods select fitplot;
proc reg data=woe_&var;
model group=woe/ cli clm ;
quit;




proc copy in=work out=z;
 select woe_: fmt_: ;
run;

Ksharp · Posted 10-05-2017 04:13 AM

The following is for category variable for maxmizing the IV value.

/*
proc import datafile='c:\temp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;
*/


%let var=marital   ;
%let group=2 ;
%let n_iter=100;


data temp;
 set have;
 keep &var good_bad ;
run;


proc sql noprint;
 select sum(good_bad='bad'),sum(good_bad='good') into : n_bad,: n_good
  from temp;
 create index &var on temp;
quit;
%put &n_bad &n_good ;

proc iml;
use temp ;
read all var {&var good_bad};
close;

start function(x) global(bin,&var ,good_bad,group,level,woe);
if countunique(x)=group then do;

do i=1 to group;
 idx=loc(x=i);
 levels=level[idx];
 index=loc(element(&var,levels));
 temp=good_bad[index];
 n_bad=sum(temp='bad');
 n_good=sum(temp='good');
 bad_dist=n_bad/&n_bad ; 
 good_dist=n_good/&n_good ; 
 if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=(Bad_Dist-Good_Dist)#log(Bad_Dist/Good_Dist);
  else woe[i]=.;
end;

if countmiss(woe)=0 then iv=sum(woe) ;


else iv=-999999;

end;
else iv=-999999;

return (iv);
finish;

group=&group ; 
bin=t(1:group);
woe=j(group,1,.);


level=unique(&var);
n_level=countunique(&var);

encoding=j(2,n_level,1 );
encoding[2,]=group ;    

id=gasetup(2,n_level,123456789);
call gasetobj(id,1,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);


niter = &n_iter ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
create group var{col_mem level};
append;
close;

print value[l = "IV Max Value:"] ;
call gaend(id);
quit;

data all;
 merge temp group(rename=(level=&var col_mem=group)) ;
 by &var;
run;



title "变量: &var" ;
proc sql;
create table woe_&var as
 select group,
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
   group by group
    order by woe;

create index group on woe_&var;
create index col_mem on group;


select *,sum(  (Bad_Dist-Good_Dist)*woe  ) as iv
 from woe_&var ;

quit;
title ' ';

data fmt_&var(index=(group)) ;
 merge group(rename=(col_mem=group)) woe_&var(keep=group woe);
 by group;
 retain fmtname "&var" type 'I';
 rename level=start woe=label;
run;

proc format cntlin=fmt_&var library=z;
run;



proc sgplot data=woe_&var;
 series x=woe y=group / markers;
run;


/*
proc sgplot data=woe_&var noautolegend;
 vbar group/response=woe nostatlabel missing;
 vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled 
  size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
  FILLEDOUTLINEDMARKERS;
run;

ods select fitplot;
proc reg data=woe_&var;
model group=woe/ cli clm ;
quit;
*/





proc copy in=work out=z;
 select woe_: fmt_: ;
run;

starzhou · Posted 07-04-2018 06:39 AM

We dont buy iml modure. How to use this sas Code?

DougWielenga · Posted 07-26-2018 01:15 PM

We dont buy iml modure. How to use this sas Code?

Since the core of that code relied on the use of SAS/IML, I don't see how you could use it without having SAS/IML licensed, and I don't know of any other tools available in Base SAS that would make that functionality readily accessible.

Hope this helps!

Cordially,

Doug

Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Re: Optimal Binning and WOE for binary target variable

Registration is open