I encountered quasi-complete separation when I used Logistic Regression.
After viewing definition of quasi-complete separation and its solutions.
There are still two confusions remained.
1:One of these solutions is FIRTH method. However, there came out "ERROR: Selection methods are not currently available with the FIRTH option" after I tried to use FIRTH method to solve the problem.
2:I can't found out which independent variable or a combination of several independent variables caused the separation.
Here is the program:
proc logistic data=maanshan320;
class season(ref='summer') treatment result season parity;
model result(event='pregnant')=season | treatment | parity | BCS / firth
selection=B sls=0.11 lackfit orpvalue ;
oddsratio treatment ;
run;
Hope you can help me.
While I like @StatDave 's response to look at HPGENSELECT, I would suggest a couple of things before you start doing variable selection. Season, treatment, parity and body condition score (BCS) seem to me to be 3 design factors and a continuous covariate, and that covariate (BCS) is well known to have an effect on pregnancy rate in mammals. So in truth you have just four variables, with possible interaction, and no real need to employ variable selection. Try the following MODEL statement:
model result(event='pregnant')=season*treatment*parity BCS BCS*season BCS*treatment BCS*parity;
This fits a fully saturated model for the design factors, with possible different slopes for the BCS relationship. Work through this to eliminate the interaction terms where the slopes do not differ. Once you have stabilized your selection of appropriate slope terms, you could then fit an effects model, with the relevant covariate/covariate by effect interaction terms in the model. This approach is covered in Milliken and Johnson's Analysis of Messy Data, vol.3: Analysis of Covariance, or in SAS for Mixed Models (any of the editions 1 to 3) in the chapter on analysis of covariance.
Also, look at the following crosstabulation:
PROC FREQ data=maanshan320; tables parity*season*treatment*result; run;
That should give 8 tables that are Nx2, where N is the number of treatments and 2 is the number of levels for result. From those 8 tables, you should readily be able to identify where the separation is occurring, if anywhere, for the design factors. Also, you may want to look at the results of PROC GLM, with BCS as the dependent variable, and the design factors crossed with the result variable as the independent variables. In the LSMEANS statement, see how BCS separates as a result of the factors.
I think the root cause of the separation issue is the inclusion of high-order interactions with BCS. For some combination or combinations of the design factors and the response, there are likely to be full separations of the covariate. Additionally, fourth order interactions do a great job of modeling noise, especially when one is a continuous variable, which brings us back to @StatDave 's comment regarding fitting the data perfectly. So think carefully about the biological question at hand (which looks like it might be related to feeding dairy cows and seeing what the resulting pregnancy rate is) and formulate a model that addresses those questions.
SteveDenham
I notice that you have SEASON listed twice in the CLASS statement, once with a REF= option and again without the option. I don't know what that does, but I suggest deleting the second SEASON. Also, delete the RESULT variable on the CLASS statement if you are not using it. These issues might be affecting the computation.
Since you have so many categorical variables, you might need to collapse some of the levels to prevent quasi-separation. Lu (2016) shows how to use Greenacre's method to choose which levels to combine, although I confess I do not know how to apply it to the case of backward selection of variables. Do you have any information about which models was being evaluated when the procedure stopped processing?
It seems that Greenacre's method is non-supervisor method (not consider about Y -- cluster analysis).
Why not use ScoreCard method to category these CLASS variables. e.g. make IV or Chisq value maximize ?
The attachment is I used for ScoreCard.
Attechment is a test data. You can use it to test my code .
After running it, open table GROUP ,that is what you need.
The separation problem is precisely why I advise against using backward selection. Unlike in ordinary regression, a logistic model that fits perfectly, or nearly perfectly, has some parameters that are infinite as a result of separation as discussed in this note. Obviously, the model that begins the backward selection is a model with all of the possible effects included and this is the model that is most likely to fit perfectly or nearly perfectly and will therefore result in the separation problem. So, instead of backward selection, use stepwise selection. Note that even with this method, it might eventually build up to a model that has separation. It might be that the model that is successfully fit at the step prior to the step causing separation is a good model for your purposes. If not, you can try fitting the model at the step with separation by fitting is in a separate PROC LOGISTIC run using the FIRTH option and without the SELECTION= option. Another thing to consider is a more modern selection method - Lasso - which is available in PROC HPGENSELECT.
While I like @StatDave 's response to look at HPGENSELECT, I would suggest a couple of things before you start doing variable selection. Season, treatment, parity and body condition score (BCS) seem to me to be 3 design factors and a continuous covariate, and that covariate (BCS) is well known to have an effect on pregnancy rate in mammals. So in truth you have just four variables, with possible interaction, and no real need to employ variable selection. Try the following MODEL statement:
model result(event='pregnant')=season*treatment*parity BCS BCS*season BCS*treatment BCS*parity;
This fits a fully saturated model for the design factors, with possible different slopes for the BCS relationship. Work through this to eliminate the interaction terms where the slopes do not differ. Once you have stabilized your selection of appropriate slope terms, you could then fit an effects model, with the relevant covariate/covariate by effect interaction terms in the model. This approach is covered in Milliken and Johnson's Analysis of Messy Data, vol.3: Analysis of Covariance, or in SAS for Mixed Models (any of the editions 1 to 3) in the chapter on analysis of covariance.
Also, look at the following crosstabulation:
PROC FREQ data=maanshan320; tables parity*season*treatment*result; run;
That should give 8 tables that are Nx2, where N is the number of treatments and 2 is the number of levels for result. From those 8 tables, you should readily be able to identify where the separation is occurring, if anywhere, for the design factors. Also, you may want to look at the results of PROC GLM, with BCS as the dependent variable, and the design factors crossed with the result variable as the independent variables. In the LSMEANS statement, see how BCS separates as a result of the factors.
I think the root cause of the separation issue is the inclusion of high-order interactions with BCS. For some combination or combinations of the design factors and the response, there are likely to be full separations of the covariate. Additionally, fourth order interactions do a great job of modeling noise, especially when one is a continuous variable, which brings us back to @StatDave 's comment regarding fitting the data perfectly. So think carefully about the biological question at hand (which looks like it might be related to feeding dairy cows and seeing what the resulting pregnancy rate is) and formulate a model that addresses those questions.
SteveDenham
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.