Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: quasi-complete separation in logistic regression

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-09-2012 08:35 AM
(7383 views)

Hello there,

I am running a logistic regression with my original data and it is ok. But I also groupped them in classes, for example:

age1: 18-25 years old

age2 :25-30 years old

age3: 30+ years old

and now I wanto to run the modell with the groupped variables but when I do I receive the messagem about quasi-complete separation. The problem is, that I don't have a clue about which variable(s) are causing this problem. How should I proceed to fix it?

Thanks!

11 REPLIES 11

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Are you using a CLASS statement where the age-related variable has three levels, reflecting what is related, or did you create three binomial age related variables variables and include them in the model somehow? If the latter is the case, you may have problems doing what I am going to ask next.

If there are three levels of a single variable and the response is binary, it is easy to run PROC FREQ with something like the following:

proc freq;

tables agecat*response:

run;

Take a look at the table and see if all of the observations fall into either the 1 or 0 response category for any of the agecat levels. If so, that is quasi-separation--for one (or more) of the age groups, all of the responses are identical.

How to solve: collapse the categories so that the group where all of the responses are identical are included with another group. Make sure that it makes good sense to combine the categories (i.e. don't combine the highest with the lowest, when they are ordered).

Good luck.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Steve, thank you for your answer.

I create the variables so :

**proc format;**

**value** AVG_DELAY_L24Mgroup low-0.5 = 'delay1'

0.5-high = 'delay2';run;

**PROC FREQ** data=score2;

** TABLES** AVG_DELAY_L24M*bad_client;

** format** AVG_DELAY_L24M AVG_DELAY_L24Mgroup.;RUN;

then I receive the table:

Category | Good | Bad |

delay1 | 437 | 376 |

delay2 | 40 | 239 |

another variable:

**proc format**;

**value** DUR_CUST_RELgroup low-500 = 'dur_cust_rel1'

500 - 1000 = 'dur_cust_rel2'

1000 - 3000 = 'dur_cust_rel3'

3000 - 5000 = 'dur_cust_rel4'

5000 - high = 'dur_cust_rel5'; **run;**

table:

Category | Good | Bad |

dur_cust_rel1 | 61 | 66 |

dur_cust_rel2 | 51 | 108 |

dur_cust_rel3 | 142 | 205 |

dur_cust_rel4 | 118 | 143 |

dur_cust_rel5 | 105 | 93 |

and in the model I do first wrote the new variables in the dataset:

**data** test;

**set** score;

**retain** AVG_DELAY_L24M_cat DUR_CUST_REL_cat;

AVG_DELAY_L24M_cat = put(AVG_DELAY_L24M,AVG_DELAY_L24Mgroup.);

DUR_CUST_REL_cat = put(DUR_CUST_REL,DUR_CUST_RELgroup.); **run;**

and the to the model:

**proc logistic** data = test descending;

**class** AVG_DELAY_L24M_cat DUR_CUST_REL_cat ;

**model** bad_client= AVG_DELAY_L24M_cat DUR_CUST_REL_cat; **run;**

And for all the other variables the frequency table is just like this two, I don't have any class without observation. I created the classes based on WOE and choose the variables for the model using the IV. I do took a look on the frequencies but I didn't saw anything suspicious.

Could be something else?

Thank you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

While it might be something else, you do have both variables in the model, and even without the interaction term, there may be a combination of levels that are empty, so I would check that as well Things may be complete for each variable alone, but together you may be missing something.

It's Friday, and my brain might be fried, but this is the only thing I can think of happening to give quasi-separation right now.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Just a comment: You can save yourself a DATA step if you use the formats in PROC LOGISTIC like you did in PROC FREQ. There is no need to explicitly construct the character variables by using the PUT function.

This is an instance where dynamically linked graphics are helpful. If you have access to JMP or SAS/IML Studio, load the three variables into the product. Create a bar chart of BAD_CLIENT and a mosaic plot of the other two variables. Click on the "Bad" group of clients. Those same observations will be highlighted in the mosaic plot, which will enable you to see which joint levels of classification variables are responsible for the quasi-separation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Separation occurs if there is a zero count for one of the response levels in any population, where a population is defined as a unique combination of levels of the predictors. So, you would need to do this to check for zero counts:

proc freq;

tables AVG_DELAY_L24M_cat * DUR_CUST_REL_cat * bad_client;

run;

And then combine populations as needed to remove the zeros. Alternatively, you may find that just adding the FIRTH option provides an adequate analysis based on a penalized likelihood method. See this note on the separation issue:

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey StatDave@sas thank you for your suggestions. Unfortunately none of them could help me. I tried everything, firth, forward, stepwise, exact logistic, observation of the estimates and standadr deviation and so on. There is no zero count in any of my categories.

I then built the model inserting one by one and now I know which variables are causing the problem (15 variables) but I dond't understand why. I just can't let all this variables out of my model without a justification.

Can you imagine another solution?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In all combinations of the 15 variables, there is not a single combination with zero count? And not just at the two-way level, but all the way up to the fifteen-way level (assuming that the variables are all categorical)? At some point, you are going to have to combine populations/levels in a critical variable to get rid of the zero count cell. At least that way, you will still have all variables accounted for.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey Steve,

well I didn't look at all combinations, because when I try to create a table with all the variables SAS tells me "ERROR: The requested table is too large to process." Maybe I don't know how to ask this for SAS. How should I do it?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Try:

proc freq noprint;

ods output list=list;

tables a*b*c*d*e*f*g*h*i*j*k*l*m*n*o/ list;

run;

where a through o are the variable names. This would output the results to a dataset ('list') that you can sort on. If this is what you tried then move to

PROC MEANS or SUMMARY. I know the syntax better for MEANS, so it is what follows.

proc means noprint nway;

class a b c d e f g h i j k l m n o;

var (response variable goes in here);

output out=datasetname n=/autoname;

run;

Then search datasetname for zeroes.

Hope this made sense.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey Steve!

I found it!!! And now, everything is working very well.

Many thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Suppose that we are building logit model using few independent variables. Event variable can be dichotomous or multinominal.

Sometimes SAS prompts "Error: Complete Separation" or "Error : Quasi Separation".

Its mean dependent Variable can be fully explain by any one independent Variable (Complete Separation) or can be fully explain by combinations of few independent variables ( Quasi Separation).

**Example 1**

Event Age

1 25

1 45

1 30

0 18

0 24

0 9

0 21

1 40

0 16

1 60

This Shows Complete Separation becaus Event =1 for age > 24

**Example 2 **

Event Age Height

0 | 16 | 160 |

0 | 18 | 130 |

0 | 24 | 178 |

0 | 21 | 145 |

0 | 45 | 120 |

1 | 38 | 160 |

1 | 55 | 150 |

1 | 30 | 169 |

1 | 40 | 170 |

1 | 25 | 170 |

This Shows Quasi Separation because Event =1 for age > 24 and Height>149.

To remove Compete Separation or Quasi Separation, 1) find out variable those were responsible for Separation ( Check the Collinearity Matrix and Classification Matrix between dependent variable and independent variables) then 2) remove those variables or classify them in group so that Complete/Quasi separation no more exist.

Hopefully this will help You.

Thanks & Regards

Ambrish

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.