BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NMB82
Obsidian | Level 7

I'm trying to run a logistic regression with random intercept in Proc Glimmix. I'm predicting emergency room restraint use (yes/no) using various categorical predictors and using PatientID as the random intercept to control for patient's having several admissions, which will be correlated. I initially went with "distribution=binary", due to the outcome being binary, though this yielded poor results...inflated AUC ROC (.98), very high residuals, underdispersion, etc. I thought to change the distribution to "negative binomial" and "link=logit" and the results are much more reasonable.

 

I wonder if this is the ideal distribution? The outcome is binary, but most patients have more than one hospital admission (though the number of admissions is not fixed) and the '0's' for the outcome (not restrained) are inflated with only 8% having a restraint. So, it's a rare event with most patients not being restrained. Am I correct to use the negative binomial distribution? I've read that an assumption of that distribution is that the number of "trials" is fixed, but this is not the case for my data. An outcome by admissions plot is below to give a better sense of this. Does the negative binomial dist with logit link seem reasonable for this model? Dist.jpg

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

It would be difficult to know if the AUC is "inflated"... perhaps the model simply fits the data quite well which is ultimately what is important. It is said that the ROC curve might be problematic when the event probability is extreme and the precision-recall curve is suggested in such cases. That curve could be obtained by saving the predicted probabilities from your GLIMMIX model and then using them in PROC LOGISTIC to obtain the classification table. You can then plot the PPV (precision) vs the sensitivity (recall). For the GLIMMIX example in this note, these statements produce the curve.

proc sql noprint;
   select sum(sideeffect)/sum(n) into :EventRate from glmmout;
   quit;
proc logistic data=glmmout plots(only)=roc;
   model sideeffect/n = predprob / ctable; 
   ods output classification=ctable;
   run;
data ctable;
   set ctable;
   if ppv ne .;
   ppv=ppv/100; sensitivity=sensitivity/100;
   run;
proc sgplot data=ctable noautolegend aspect=1;
   xaxis values=(0 to 1 by .25) grid 
         offsetmin=.05 offsetmax=.05 label="Recall / Sensitivity"; 
   yaxis values=(0 to 1 by .25) grid 
         offsetmin=.05 offsetmax=.05 label="Precision / PPV";
   refline &EventRate;
   series y=ppv x=sensitivity;
   title "Precision-Recall Curve";
   run;

View solution in original post

6 REPLIES 6
SteveDenham
Jade | Level 19

I don't know if you can use a Tweedie distribution in a GEE model in PROC GENMOD (heads up to @StatDave ), but if so, that sounds like a possible way to go after this.  The negative binomial and generalized Poisson are also good for overdispersed count data - and you have had good luck so far with the negative binomial.  Maybe either the Tweedie or generalized Poisson would be an improvement over the negative binomial, but that is definitely data dependent.

 

SteveDenham

NMB82
Obsidian | Level 7

Thank you, Steve. I appreciate the response. I've learned a new distribution today! I'll look into that further...looks like it's available in GENMOD, so I'll give it a try.

 

This is an additional model at the request of a reviewer, so I'm trying to anticipate comments they may have. Would the data I have justify a negative binomial distribution with logit link? As I said, the outcome is binary (ED_RestraintFLG in the image), but it's measured over several hospital admissions, so it technically becomes a "count" over admissions and very much looks like a negative binomial distribution. I didn't initially use this distribution because I thought the outcome itself needed to be a count and not binary. I just want to be clear on this before I lock in any distribution other than "binary." 

StatDave
SAS Super FREQ

It isn't clear whether you want to model the individual, repeated, binary responses from subjects or the total number of admissions by each subject. If the former, than you probably need to take into account the correlations among the repeated measures. This can be done with a GEE model (REPEATED statement in PROC GEE) or a random effects model (RANDOM statement in PROC GLIMMIX) using the binomial distribution. Note that the GEE method usually takes pretty good care of overdispersion if it exists. If it's the latter, then there are no repeated measures or correlations to account for and you can model the count of admissions using the Poisson distribution, or the negative binomial distribution if there is evidence of overdispersion. I'll just note that there is a zero-inflated binomial model that can deal with an overabundance of zeros. It can be fit in PROC FMM as shown in this note but this does not deal with repeated measures. Another model for overdispersed binary data is the binomial cluster model which can also be fit in PROC FMM. See the discussion of overdispersion in this note.

NMB82
Obsidian | Level 7

Thanks! It's the former...I want to model the repeated binary outcomes while accounting for correlations between the outcomes within a patient having several hospital admissions. I'm currently using Proc Glimmix with Binomial dist and logit link with a random statement with subject=patientID. The results seem reasonable (estimates and model fit) except that the AUC seems inflated at .97, though I read this is common with rare event data (my outcome has ~8% event rate). Additionally, the "Pearson Chi Sq/DF" is "0.40", which indicates under-dispersion, though most analysts concerns seem to be with over-dispersion, so I wonder if the "0.40" result is ignorable. I re-ran this model with Dist=NegBinomial and link=logit and the AUC and Pearson/DF numbers appeared more reasonable. 

 

At this point, I assume based on my data, I will need to use "dist=binomial" only? I can state there is no sign of over-dispersion with "Pearson Chi Sq/DF" not >1.0 and maybe explore a different measure of predictive ability...like AUC-Precision/Recall? Am I correct in this line of thought? 

StatDave
SAS Super FREQ

It would be difficult to know if the AUC is "inflated"... perhaps the model simply fits the data quite well which is ultimately what is important. It is said that the ROC curve might be problematic when the event probability is extreme and the precision-recall curve is suggested in such cases. That curve could be obtained by saving the predicted probabilities from your GLIMMIX model and then using them in PROC LOGISTIC to obtain the classification table. You can then plot the PPV (precision) vs the sensitivity (recall). For the GLIMMIX example in this note, these statements produce the curve.

proc sql noprint;
   select sum(sideeffect)/sum(n) into :EventRate from glmmout;
   quit;
proc logistic data=glmmout plots(only)=roc;
   model sideeffect/n = predprob / ctable; 
   ods output classification=ctable;
   run;
data ctable;
   set ctable;
   if ppv ne .;
   ppv=ppv/100; sensitivity=sensitivity/100;
   run;
proc sgplot data=ctable noautolegend aspect=1;
   xaxis values=(0 to 1 by .25) grid 
         offsetmin=.05 offsetmax=.05 label="Recall / Sensitivity"; 
   yaxis values=(0 to 1 by .25) grid 
         offsetmin=.05 offsetmax=.05 label="Precision / PPV";
   refline &EventRate;
   series y=ppv x=sensitivity;
   title "Precision-Recall Curve";
   run;
NMB82
Obsidian | Level 7

Thank you, this helps a lot. The AUC-PR curve looks pretty good, but certainly not as crazy high as the AUC-ROC. I appreciate the help!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1186 views
  • 3 likes
  • 3 in conversation