Solved: Re: A question about quasi-complete separation

LiHZ · Posted 03-02-2021 05:09 AM

I encountered quasi-complete separation when I used Logistic Regression.

After viewing definition of quasi-complete separation and its solutions.

There are still two confusions remained.

1:One of these solutions is FIRTH method. However, there came out "ERROR: Selection methods are not currently available with the FIRTH option" after I tried to use FIRTH method to solve the problem.

2:I can't found out which independent variable or a combination of several independent variables caused the separation.

Here is the program:

proc logistic data=maanshan320;
class season(ref='summer') treatment result season parity;
model result(event='pregnant')=season | treatment | parity | BCS / firth
selection=B sls=0.11 lackfit orpvalue ;
oddsratio treatment ;
run;

Hope you can help me.

SteveDenham · Posted 03-02-2021 02:55 PM

While I like @StatDave 's response to look at HPGENSELECT, I would suggest a couple of things before you start doing variable selection. Season, treatment, parity and body condition score (BCS) seem to me to be 3 design factors and a continuous covariate, and that covariate (BCS) is well known to have an effect on pregnancy rate in mammals. So in truth you have just four variables, with possible interaction, and no real need to employ variable selection. Try the following MODEL statement:

model result(event='pregnant')=season*treatment*parity BCS BCS*season BCS*treatment BCS*parity;

This fits a fully saturated model for the design factors, with possible different slopes for the BCS relationship. Work through this to eliminate the interaction terms where the slopes do not differ. Once you have stabilized your selection of appropriate slope terms, you could then fit an effects model, with the relevant covariate/covariate by effect interaction terms in the model. This approach is covered in Milliken and Johnson's Analysis of Messy Data, vol.3: Analysis of Covariance, or in SAS for Mixed Models (any of the editions 1 to 3) in the chapter on analysis of covariance.

Also, look at the following crosstabulation:

PROC FREQ data=maanshan320;
tables parity*season*treatment*result;
run;

That should give 8 tables that are Nx2, where N is the number of treatments and 2 is the number of levels for result. From those 8 tables, you should readily be able to identify where the separation is occurring, if anywhere, for the design factors. Also, you may want to look at the results of PROC GLM, with BCS as the dependent variable, and the design factors crossed with the result variable as the independent variables. In the LSMEANS statement, see how BCS separates as a result of the factors.

I think the root cause of the separation issue is the inclusion of high-order interactions with BCS. For some combination or combinations of the design factors and the response, there are likely to be full separations of the covariate. Additionally, fourth order interactions do a great job of modeling noise, especially when one is a continuous variable, which brings us back to @StatDave 's comment regarding fitting the data perfectly. So think carefully about the biological question at hand (which looks like it might be related to feeding dairy cows and seeing what the resulting pregnancy rate is) and formulate a model that addresses those questions.

SteveDenham

View solution in original post

Rick_SAS · Posted 03-02-2021 06:25 AM

I notice that you have SEASON listed twice in the CLASS statement, once with a REF= option and again without the option. I don't know what that does, but I suggest deleting the second SEASON. Also, delete the RESULT variable on the CLASS statement if you are not using it. These issues might be affecting the computation.

Since you have so many categorical variables, you might need to collapse some of the levels to prevent quasi-separation. Lu (2016) shows how to use Greenacre's method to choose which levels to combine, although I confess I do not know how to apply it to the case of backward selection of variables. Do you have any information about which models was being evaluated when the procedure stopped processing?

Ksharp · Posted 03-02-2021 06:48 AM

It seems that Greenacre's method is non-supervisor method (not consider about Y -- cluster analysis).

Why not use ScoreCard method to category these CLASS variables. e.g. make IV or Chisq value maximize ?

The attachment is I used for ScoreCard.

LiHZ · Posted 03-02-2021 07:35 AM

Thank you for your reply.
I'm new to SAS, so your suggestion is too academic to understand.
It seems that using ScoreCard method to category the variables on CLASS is a good way.
I'll try to use it.

Ksharp · Posted 03-02-2021 08:07 AM

Attechment is a test data. You can use it to test my code .

After running it, open table GROUP ,that is what you need.

LiHZ · Posted 03-02-2021 07:30 AM

Thank you for your reply,
Yes, I repeat SEASON twice by accident, I deleted the second one.
RESULT is a respond variable used in the MODEL, the computation seems no difference when I deleted RESULT on the CLASS.
The procedure can ran only with "WARNING: There is possibly a quasicomplete separation of data points in step 0. The maximum likelihood estimate may not exist.WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable."

StatDave · Posted 03-02-2021 10:44 AM

The separation problem is precisely why I advise against using backward selection. Unlike in ordinary regression, a logistic model that fits perfectly, or nearly perfectly, has some parameters that are infinite as a result of separation as discussed in this note. Obviously, the model that begins the backward selection is a model with all of the possible effects included and this is the model that is most likely to fit perfectly or nearly perfectly and will therefore result in the separation problem. So, instead of backward selection, use stepwise selection. Note that even with this method, it might eventually build up to a model that has separation. It might be that the model that is successfully fit at the step prior to the step causing separation is a good model for your purposes. If not, you can try fitting the model at the step with separation by fitting is in a separate PROC LOGISTIC run using the FIRTH option and without the SELECTION= option. Another thing to consider is a more modern selection method - Lasso - which is available in PROC HPGENSELECT.

LiHZ · Posted 03-02-2021 10:51 PM

Than you for your reply,
@SteveDenham and your suggestion were very professional and useful.

SteveDenham · Posted 03-02-2021 02:55 PM

While I like @StatDave 's response to look at HPGENSELECT, I would suggest a couple of things before you start doing variable selection. Season, treatment, parity and body condition score (BCS) seem to me to be 3 design factors and a continuous covariate, and that covariate (BCS) is well known to have an effect on pregnancy rate in mammals. So in truth you have just four variables, with possible interaction, and no real need to employ variable selection. Try the following MODEL statement:

model result(event='pregnant')=season*treatment*parity BCS BCS*season BCS*treatment BCS*parity;

This fits a fully saturated model for the design factors, with possible different slopes for the BCS relationship. Work through this to eliminate the interaction terms where the slopes do not differ. Once you have stabilized your selection of appropriate slope terms, you could then fit an effects model, with the relevant covariate/covariate by effect interaction terms in the model. This approach is covered in Milliken and Johnson's Analysis of Messy Data, vol.3: Analysis of Covariance, or in SAS for Mixed Models (any of the editions 1 to 3) in the chapter on analysis of covariance.

Also, look at the following crosstabulation:

PROC FREQ data=maanshan320;
tables parity*season*treatment*result;
run;

That should give 8 tables that are Nx2, where N is the number of treatments and 2 is the number of levels for result. From those 8 tables, you should readily be able to identify where the separation is occurring, if anywhere, for the design factors. Also, you may want to look at the results of PROC GLM, with BCS as the dependent variable, and the design factors crossed with the result variable as the independent variables. In the LSMEANS statement, see how BCS separates as a result of the factors.

I think the root cause of the separation issue is the inclusion of high-order interactions with BCS. For some combination or combinations of the design factors and the response, there are likely to be full separations of the covariate. Additionally, fourth order interactions do a great job of modeling noise, especially when one is a continuous variable, which brings us back to @StatDave 's comment regarding fitting the data perfectly. So think carefully about the biological question at hand (which looks like it might be related to feeding dairy cows and seeing what the resulting pregnancy rate is) and formulate a model that addresses those questions.

SteveDenham

LiHZ · Posted 03-02-2021 10:56 PM

Than you for your reply,
@StatDave and your suggestions were professional and useful.
It will take a time for me to digest your comment, but I suppose yours as a good solution.

Ready to join fellow brilliant minds for the SAS Hackathon?