Weight of Evidence, Dummy Variables and Degrees of Freedom

Paper 1022-2021

Authors: Bruce Lund, Statistical Consultant and Trainer, Novi Michigan

Abstract

Models with a binary target are often fitted by logistic regression. An important step in logistic regression is transforming predictors before the model fitting stage. In credit risk modeling, predictors are often transformed by weight of evidence (WOE) coding. An alternative to WOE coding is dummy variable coding. Let C be a discrete predictor and C_woe be its WOE coding. A model where C_woe is the only predictor gives the same probabilities as the model with C appearing in CLASS C as the only predictor. Hence, the degrees of freedom (d.f.) of C_woe is L-1 where C has L levels. But if additional predictors are in a model, it is unclear how to assign d.f. to C_woe when considering the entry of C_woe into the model. Does C_woe have 1 d.f., L-1 d.f., or something in between? This ambiguity affects usage of predictor selection methods based on p-values, AIC or BIC. This presentation discusses how a model with predictor C_woe and other predictors can be viewed as nested within the model with CLASS C and predictors C and . This nesting property suggests a process to assign d.f. to C_woe when entering it into a model. This process enables forward selection to select predictors for entry where the d.f. for WOE predictors are adjusted (not simply given 1 d.f.). An algorithm is provided for forward selection with adjusted d.f. for WOEs to choose the logistic model with minimum AIC. A SAS macro is provided.

Watch the presentation

Watch Weight of Evidence, Dummy Variables and Degrees of Freedom on the SAS Users YouTube channel.

INTRODUCTION

Predictive models with a binary target are often fitted by logistic regression.[1] An important step in using logistic regression is the transforming of predictors before the model fitting stage. In credit risk modeling and direct marketing modeling, predictors are often transformed by weight of evidence (WOE) coding.

The books by Siddiqi (2019), Finlay (2010), and Thomas (2009) show the usage of weight of evidence coding for credit risk modeling.

This WOE approach applies to discrete predictors (i.e. having only a few levels), whether nominal, ordinal, or numeric. It also applies to continuous numeric predictors after the predictor has been reduced to discrete ranges (“fine classing”).

A widely used alternative to WOE coding is dummy variable coding. Let C be a discrete predictor and C_woe be its WOE coding. A model where C_woe is the only predictor has the same probabilities as the model with C appearing in CLASS C as the only predictor. Hence, the degrees of freedom (d.f.) of C_woe are L‑1 where C has L levels.

But if there are additional predictors in the model, it is unclear how to assign d.f. to C_woe when considering the entry of C_woe into the model. Does C_woe have 1 degree of freedom, L-1 degrees of freedom, or something in between? This ambiguity affects the usage of predictor selection methods based on p-value significance, AIC, or BIC.[2]

In this paper it is shown that a model with predictor C_woe and other predictors <X> can be thought of as “nested” within the model with CLASS C, predictor C, and the same other predictors <X>. It will be this nesting property which suggests a process to assign d.f. to C_woe when entering a model.

This process enables the use of forward selection to select predictors for entry to minimize AIC but where the d.f. for WOE predictors are adjusted, not simply assigned 1 d.f.

A SAS macro to implement this process is available from the author.

-----

[1] In this paper it is assumed that a logistic model does not have complete or quasi-complete separation. After models with separation are excluded, the logistic model has a unique maximum likelihood estimate. For discussion of separation, see Allison (2012, Ch. 3).

[2] AIC = -2*Log(L) + 2*K where K = 1 + d.f. of predictors in model. BIC = -2*Log(L) + log(N)*K with sample size N.

FULL DETAILS

See the attached paper "lund_1022.pdf" for full details. The "statement of record" is given by this paper. The MAIN HEADINGS below give only short summaries.

WEIGHT OF EVIDENCE (WOE) AND DUMMY VARIABLES

In Table 1 the weight of evidence coding (or transformation) of predictor C is illustrated. Predictor C can be numeric or character, with or without an ordering of its levels. The right most column gives the value “WOE(c_j)” of the weight of evidence transformation of C = c_j.

C	Y = 0 “B_j”	Y = 1 “G_j”	Col % Y=0 “b_j”	Col % Y=1 “g_j”	WOE(c_j)= Log(g_j/b_j)
c1	2	1	B₁ / B = 0.250	G₁ / G = 0.125	-0.69315
c2	1	1	B₂ / B = 0.125	G₂ / G = 0.125	0.00000
c3	5	6	B₃ / B = 0.625	G₃ / G = 0.750	0.18232
SUM	B=8	G=8

Table 1. Weight of Evidence Transformation of C

“NESTING” OF WOE MODEL WITHIN THE CLASS MODEL

Suppose C has L > 2 levels and C appears in a CLASS statement in a logistic model:

PROC LOGISTIC descending;

CLASS C (PARAM=ref); MODEL Y = C <other predictors>;

This will be called the "CLASS model".

The statement CLASS C with (PARAM=ref) has the effect of creating dummy variables for the lowest (in natural sort order) of the L-1 levels of C. Effect of PARAM=ref is to set to zero the implied coefficient of the dummy variable for C = c_L.

Weight of evidence recoding of C is an alternative to using dummy variable coding for entering the predictor C into a logistic model. With C_woe, the SAS logistic model statement is:

PROC LOGISTIC descending;

MODEL Y = C_woe <other predictors>;

This will be called the "WOE model".

In the simple case where the “WOE” model and “CLASS” model have no <other predictors>, these models are actually the same model. That is, they give the same probabilities. However, the CLASS and WOE models are not the same if <other predictors> are included.

But the model with WOE predictors as well as <other predictors> is “nested” within the model with the corresponding CLASS predictors and the same <other predictors>. The term “nested” is used here in a non-standard manner. Here is the usage:

For the CLASS model there exist values for the coefficients that give the same probabilities as the probabilities from the maximum likelihood estimator (MLE) solution for the WOE model. These coefficients for the CLASS model are not its MLE’s.

WHY IS WEIGHT OF EVIDENCE CONSIDERED?

The WOE coding of a predictor C gives the modeler more control over how this predictor impacts the predictions of the logistic model. See the paper.

PROBLEMS WITH WOE FOR PREDICTOR SELECTION

Let C have L > 2 levels. A disadvantage of WOE coding is that the degrees of freedom for C_woe, when being entered into a logistic model, are unknown. Recall that the coding of C_woe makes heavy usage of information about the target variable Y. There should be a degrees of freedom penalty for this usage.

This d.f. assignment is important when considering the use of p-values in predictor selection methods (e.g. stepwise p-value based) and also for predictor selection methods based on minimum BIC and AIC (as are provided by PROC HPLOGISTIC).

None of the SAS procedures allow for d.f. adjustment for WOE predictors. Of course, more fundamentally, it is unclear how to make such an adjustment.

I have not seen a discussion of how to assign degrees of freedom for WOE predictors in modeling applications. I assume that, in practice, C_woe is simply regarded as having 1 d.f. in conformance with its usage in PROC LOGISTIC and PROC HPLOGISTIC. In situations with many predictors, is the use of the 1 d.f. assignment a reasonable simplifying assumption? Research on the question is needed. Some insights are given in discussions which follow.

THE MODEL COMPARISON TEST - A REVIEW

The model comparison test requires truly nested models where each model has degrees of freedom equal to its number of parameters. The test statistic T is the difference of the “‑2*Log(L)” from the models:

T = -2*Log(L)_restricted - (-2*Log(L)_full)

For large samples the distribution of T is a chi-square with degrees of freedom given by

d.f._full - d.f._restricted.

Let t be the value of T from a sample and specify α in 0 < α < 1 (e.g. α = 0.05). If P(T ≥ t) > α, then the restricted and full model are deemed statistically equal.

D.F. FOR C_WOE IN SELECTION=FORWARD

In this section an algorithm is presented for adjusting the degrees of freedom for a weight of evidence predictor as it enters a model by forward selection. This adjustment enables the selection of a predictor for entry that gives the minimum AIC (BIC) at each forward step. The algorithm utilizes the ideas from the model comparison test.

For this purpose a new “model comparison test” is proposed. Admittedly, this new test does abuse the standard model comparison test. Here, the WOE and CLASS models (having the same predictors already in the models) will be regarded as nested and the test statistic T is treated as a chi-square. Let C have L levels:

T = -2*Log(L)_woe - (-2*Log(L)_class)

with d.f._T = d.f._class - d.f._woe.

While d.f._class are known, the value of d.f._T is not known, due to the uncertain d.f. for C_woe. If, by some means, d.f._T could be assigned, then d.f._woe are known. The next definition assigns this value of d.f._T. First, let t be the sample value of T.

Definition: Given α (0 < α <1), then d.f._T are declared to be the d.f. value so that

P(T > t | d.f._T) = α, with fractional values of d.f._T being allowed.[1]

These are the d.f. that just barely make WOE and CLASS models statistically equal (for α).

This leads to the final formula for the degrees of freedom for C_woe when entering a model with other predictors already having been selected:

C_woe_d.f. = (L-1) - d.f._T

-----

[1] SAS function cdf(‘CHISQ’, t, k) gives cum chi-square probability P(T < t) for k d.f. for any k > 0.

FSAA: FORWARD SELECTION WITH ADJUSTED AIC

Here the algorithm is specified step by step

AN EXAMPLE OF FSAA

An example is given.

SAS MACRO AVAILABLE

The author has a SAS macro, with examples and documentation, for the FSAA process, available upon request. Documentation and the macro SAS code are also attached to this posting.

CONCLUSION

This paper proposes a solution to the “degrees of freedom problem” for weight of evidence coded predictors when fitting a logistic model using forward selection where the selected predictor, at each step, minimizes AIC among the candidates for entry.

References

Allison, P.D. (2012), Logistic Regression Using SAS: Theory and Application 2^nd Ed., Cary, NC : SAS Institute Inc.

Finlay, S. (2010). Credit Scoring, Response Modelling and Insurance Rating, London, UK : Palgrave Macmillan.

Hosmer D., Lemeshow S., Sturdivant R. (2013). Applied Logistic Regression, 3^rd Ed., Hoboken, NJ : John Wiley & Sons.

Lund, B. (2017). SAS® Macros for Binning Predictors with a Binary Target, Proceedings of the SAS Global Forum 2017.

Siddiqi, N. (2017). Intelligent Credit Scoring, 2nd edition, Hoboken, NJ : John Wiley & Sons.

Thomas, L. (2009), Consumer Credit Models, Oxford, UK : Oxford University Press.

blund · ‎05-20-2021

My macro is not suitable if your model is too big or if you want to fit your model using procedures not based on FORWARD with AIC. In this case, follow these steps:

Use classification variables instead of their weight of evidence versions and fit a model. This is the CLASS model.
After final predictors are selected, then refit with weight of evidence versions of these CLASS predictors. This is the WOE model.
Compare performance of the two models, ideally, on a validation sample. If WOE performance is similar to CLASS, then good. Use the WOE model.
If WOE model is distinctly inferior, then remove the WOE variable with smallest Wald chi-square statistic value and replace with its classification version and refit. Now compare CLASS model to this new model.

5. Iterate this procedure until the CLASS model and the modified WOE model are comparable.