Logistic Regression: Quasi-Complete Separation and Smoothed Weight of Evidence Coding

2 Likes

Do you frequently use binary logistic regression? If you do, you may have encountered a problem called quasi-complete separation. This problem is often associated with categorical variables with many levels and it can cause the parameter estimates and p-values for your model to be untrustworthy. In this post, I’ll explain the problem and my favorite approach to dealing with it: converting the categorical predictor to a continuous predictor using smoothed weight of evidence coding.

So, what is quasi-complete separation? Before describing quasi-complete separation, it may be helpful to understand complete separation in logistic regression. For this example, I’m using the AmesHousing3 data set from the SAS course Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression (these data are a modified subset of the data from De Cock 2011). In these data, each observation is a house that was sold in Ames, Iowa between the years 2006–2010. The predictor variables in these data are descriptors of the houses such as the basement area in square feet (Basement_Area) and the number of bedrooms (Bedroom_AbvGr). The response variable Bonus is a binary variable with the value 1 indicating that the realtors selling houses were eligible to receive a bonus for the sale and 0 indicating they were not.

Using PROC LOGISTIC to fit the model for Bonus with the predictors Basement Area and SalePrice of the houses results in the following warning in the log and convergence status message in the results:

proc logistic data=STAT1.AmesHousing3;
model Bonus(event='1')= Basement_Area SalePrice;
run;

WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

te_1_taelna-blog1.05.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

This is accompanied by unusual p-values for the overall model. Assuming alpha=0.05, the likelihood ratio p-value below says the model is highly significant (p<0.0001), but the Wald chi-squared p-value is at least 500X bigger than Likelihood Ratio p-value (p=0.0504).

te_2_taelna-blog1.03.png

Have you ever seen this complete separation problem come up in your own logistic regression analyses? Possibly not, because this is actually a rare problem. The related problem “quasi-complete separation” is much more common. With complete separation, SAS is telling us that the maximum likelihood estimate does not exist. What caused this?

Well, I caused it. I was able to create the complete separation problem by running a binary logistic regression problem with a “perfect” predictor. In the AmesHousing3 data set, Bonus =1 when the price of a house (SalePrice) is greater than $175,000 otherwise Bonus=0. So, SalePrice completely separates the observations into the two possible outcomes for the response: bonus eligible and bonus ineligible. Complete separation means that there is some linear combination of the explanatory variables that perfectly predicts the dependent variable. When this happens, the maximum likelihood algorithm does not converge and the results of the model cannot be trusted.

Quasi-complete separation

Quasi-complete separation is very similar to complete separation. When a continuous predictor nearly completely separates the binary response into different categories, it can produce quasi-complete separation. But the more common cause is having a categorical predictor with levels that have all events or non-events for each case. Let’s see an example of this.

Now I’m changing my logistic regression model to have Bedrooms_AbvGr (bedrooms above ground) as the sole predictor. This categorical variable has 5 levels, corresponding to 0–4 bedrooms.

proc logistic data=STAT1.ameshousing3;
class Bedroom_AbvGr/param=ref ref=first;
model Bonus(event='1')= Bedroom_AbvGr;
run;

Here is the warning from the log and the model status message in the results:

WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood estimate may not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

te_3_taelna-blog1.05.png

Partial output:

tte_4_aelna-blog1.06.png

Bedrooms is not a significant predictor but with the quasi-complete separation warning, the parameter estimates, odds ratios, and p-values can’t be trusted. Even before we ran the model, we could tell that there will be a problem when we look at a cross-tabulation of Bonus and Bedrooms:

te_5_taelna-blog1.05bedrooms_table.png

Of the houses with 0 or 4 bedrooms above ground, 0 were bonus eligible. This is problematic for logistic regression because the logit for each level is calculated as the natural logarithm of (# bonus eligible houses/# bonus ineligible houses) and the natural log of zero is undefined. This could make the parameter estimates invalid for the levels with zero events (i.e., 0 bonus eligible houses). Even worse, if one of these levels is the reference level for the parameter estimates (the last level 4 is the default but I used level 0) then the parameter estimates for all levels will be affected. The same problem would occur if a level had no bonus ineligible houses because calculating the logit would involve dividing by zero which of course is also not defined.

Are the levels with zero events really perfect predictors? No. Small sample size in a level can result in zero events or non-events by chance. Perhaps with a larger sample size for say 4-bedrooms, there might be some bonus eligible houses. This quasi-complete separation problem can come up with small sample sizes or when categorical predictors have many levels, since there are more chances to have zero events or non-events. How can we avoid this problem?

One way to remove the quasi-complete separation problem from your logistic regression is to convert the categorical predictor into a continuous predictor. The continuous predictor can be calculated using smoothed weight of evidence (SWOE) coding. SWOE is similar to averaging the log odds in a particular level with the log odds of the overall sample. Let’s see the formula for calculating smoothed weight of evidence.

If we replaced the levels of bedrooms with the natural log of the odds of being bonus eligible (i.e., the logit), we would be using “weight of evidence” (WOE) coding:

te_6_taelna-blog1.07.png

WOE coding replaces each level of the categorical predictor with the logit, a continuous variable. But the logit is undefined when there are zero events or non-events. So we need to modify WOE to get rid of zeros in the numerator and denominator. One approach is to add a small number to both the numerator and the denominator (say 0.5) to remove any zeros. Another approach is to calculate smoothed weight of evidence (SWOE):

te_7_taelna-blog1.08-1.png

where ρ₁= the proportion of events in the entire sample and c is a smoothing parameter. Since ρ₁/(1-ρ₁) is the odds of the event for the whole sample, the equation for SWOE is like a weighted average of the logit for each level and the logit for the overall sample. Higher values of c reduce the variability of the logits for each level and make them more similar to the logit for the overall sample.

What’s a good choice for the smoothing parameter c? This is an empirical question that depends on your specific data set. Several values of c can be tested to see which produces the SWOE that has the strongest relationship with the response variable. Ideally, c could be found using a hold-out validation data set to reduce overfitting.

I tried a few values of the smoothing parameter (1, 5, 10, 15, 25) and found c=1 produced the best SWOE. Bear in mind that AmesHousing3 is a small data set (n=300) and I did not use a hold-out validation data set to find c. Here’s the code I used to calculate SWOE and add it to the data. The PROC MEANS step creates a data set (Counts) that has the number of bonus eligible houses (events) and the total number of houses (_FREQ_) for each level of Bedrooms. In the Data step, I calculate SWOE (bedrooms_swoe) using ρ₁=0.176 and the smoothing parameter c=1.

proc means data=STAT1.AmesHousing3 sum nway;
class Bedroom_AbvGr;
var Bonus;
output out=counts sum=events;
run;

%let rho1= 0.176;   *45 bonus eligible/255 bonus ineligible houses;
%let c=1;

data counts;
set counts;
nonevents=_FREQ_-events;
bedrooms_swoe=log((events+&c*&rho1)/(nonevents+&c*(1-&rho1)));
run;

proc print data=counts noobs;
var Bedroom_AbvGr events nonevents bedrooms_swoe;
run;

The PROC PRINT output:

te_8_taelna-blog1.09.png

Then to add SWOE to the data and fit the model for Bonus with the continuous predictor, I used the following code:

data AmesHousing3;
set Stat1.AmesHousing3;
if Bedroom_AbvGr=0 then bedrooms_swoe=-2.772588722;
if Bedroom_AbvGr=1 then bedrooms_swoe=-1.393311934;
if Bedroom_AbvGr=2 then bedrooms_swoe=-1.676328118;
if Bedroom_AbvGr=3 then bedrooms_swoe=-1.747823684;
if Bedroom_AbvGr=4 then bedrooms_swoe=-3.912023005;
run;

proc logistic data=AmesHousing3;
model Bonus(event='1')=bedrooms_swoe;
run;

Partial output:

te_9_taelna-blog1.10.png

te_10_taelna-blog1.11.png

The model converged and bedrooms_swoe has more reasonable parameter estimates, odds ratios, and p-values.

Not only does SWOE coding remove any quasi-complete separation problem, it also reduces the number of parameters being estimated in the model. In these models, 4 parameters are being estimated for Bedrooms_AbvGr but only 1 is estimated for bedrooms_swoe. What if instead of a 5-level categorical predictor, we had a 1,000 level categorical predictor? SWOE coding would convert this to a continuous predictor and still require only 1 parameter estimate instead of nearly 1,000.

Are you interested in learning more about SWOE and other methods for dealing with quasi-complete separation? Then consider taking the SAS class Predictive Modeling Using Logistic Regression. Not only will you learn about other approaches to dealing with the quasi-complete separation problem, this class is great preparation for attaining the SAS Statistical Business Analyst credential.

References

De Cock 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education Volume 19, Number 3