Hi @Ksharp ! I did an experiment with the German Credit data to illustrate my point about linearity of target and predictor versus visual linearity of bin WoE values with bin sequence numbers. The attachment imports the full thousand record German Credit data set, sorted by increasing order of Credit_Amount (the original interval-valued predictor), with five added variables: 1. idnum is just a sequence number for the original order of the observations 2. group_ks gives bin sequence numbers for the binning in your paper (group_ks = 1 for the first 699 records, up to Credit_Amount = 3578, group_ks = 2 for the next 139 records, up to Credit_Amount = 5743, group_ks = 3 for the next 60 records, up to Credit_Amount = 7127, group_ks = 4 for the last 102 records, up to Credit_Amount = 18424) 3. woe_ks gives the corresponding WoE value for each of the bins from your paper (woe_ks = 0.167231311070365 when group_ks = 1, woe_ks = -0.0441497846129298 when group_ks = 2, woe_ks = -0.371874163672129 when group_ks = 3, woe_ks = -0.72951482473082 when group_ks = 4) 4. group_sd gives bin sequence numbers for the maximum information value binning with all bin event rate differences significant with 95% confidence and 5% minimum distribution requirement with group size at least 60 (group_sd = 1 for the first 739 records, up to Credit_Amount = 3905, group_sd = 2 for the next 183 records, up to Credit_Amount = 7758, group_ks = 3 for the last 78 records, up to Credit_Amount = 18424) 5. woe_sd gives the corresponding WoE value for each of the group_sd bins (woe_sd = 0.2208734028 when group_sd = 1, woe_sd = -0.345205917 when group_sd = 2, woe_sd = -1.00144854 when group_sd = 3) Then I ran logistic regressions of Creditability, the binary target, against, respectively, group_ks as a numeric variable, woe_ks as a numeric variable, group_ks as a class variable, group_sd as a numeric variable, woe_sd as a numeric variable, group_sd as a class variable. In each case, I had PROC LOGISTIC run the Hosmer-Lemeshow test; it's only on development data because there is no extra test data. %let runid = 01 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = group_ks ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ;
%let runid = 02 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = woe_ks ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ;
%let runid = 03 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = group_ks ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. (class) DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
class &&bin_predictor_&runid.. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ;
%let runid = 04 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = group_sd ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ;
%let runid = 05 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = woe_sd ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ;
%let runid = 06 ;
%let indata&runid. = work.german_credit_groups ;
%let binary_target_&runid. = Creditability ;
%let bin_predictor_&runid. = group_sd ;
title2 "PROC LOGISTIC &runid. &&binary_target_&runid.. by &&bin_predictor_&runid.. (class) DATA = &&indata&runid.." ;
PROC LOGISTIC DATA = &&indata&runid.. DESCENDING OUTEST = work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.
OUTMODEL = work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid. ;
class &&bin_predictor_&runid.. ;
model &&binary_target_&runid.. = &&bin_predictor_&runid.. / lackfit rsquare ;
store work.gc_logit_&&bin_predictor_&runid.._store_&runid. ;
output out = work.gc_logit_&&bin_predictor_&runid.._output_&runid. predicted = predicted&runid. xbeta = xbeta&runid.
reschi = reschi&runid. resdev = resdev&runid. reslik = reslik&runid. ;
run ;
title2 ; The first model uses your bin sequence numbers as a single variable, group_ks. Because you tried to create linearity, the Hosmer-Lemeshow test shows pretty close agreement between observed and expected, chi-square = 0.2012 and p value = 0.9043. But if you run the second model, using woe_ks, the WoE values for the same set of bins, as a single variable, you get perfect agreement between observed and expected in Hosmer-Lemeshow, chi-square = 0 and p value = 1. Same perfect result for the third model, where you use group_ks as a class variable. group_sd has similar results, although it does slightly better at model fit and classification / rank ordering than group_ks, and the only requirements for group_sd were that the event rate differences between neighboring groups should be significant (with event rates monotonically decreasing) and the minimum group size be 60 (with the 5% minimum distribution requirement too). group_sd made no attempt to show visual linearity. So, if you just use WoE values to represent the bins, or make the bins class variables, you get the best linear response between predictor and target, there's no need to try to obtain visual linearity.
... View more