I have a dataset of admissions to emergency dept's over a 2yr period. The outcome is binary and "rare-event-ish" (10% of total). We're mainly interested if certain patient characteristics predict the outcome. We have 10 variables of interest (all categorical/binary), ~20,000 patients, and ~32,000 ED admissions. Some patients have >1 admission in this time period, so a basic logistic regression an encounter level would violate the assumption of independence. Though this wasn't a major interest of ours initially, there is a site variable (11 different emergency dept's) that could be used to look at between/within sites.
Would logistic regression with robust standard errors work in this case? This assumes there is some correlation within clusters (within patients) and adjusts for that.
Would multilevel logistic regression with patient admissions clustered within ED sites be more appropriate and/or rigorous? Sample size and event rate vary considerably between sites (.2%-40% for sample size and .3%-58% for event rate).
Another model not mentioned?
I've done some MLM, though it has been years since grad school. Any tips or papers that might help is appreciated!
This would typically be done using either a random effects logistic model, as can be fit in PROC GLIMMIX using its RANDOM statement, or with a Generalized Estimating Equations (GEE) model, as can be fit in PROC GEE (or GENMOD) using its REPEATED statement. The first is a subject-specific model and the second is a population-averaged model. So, the choice depends on what you want the model for - if it's to predict the result for individuals, then you probably want the GLIMMIX approach. If you want to make general conclusions about the population, this suggests the GEE approach. You can find examples in the Getting Started sections of the documentation of both procedures.
... and yes, you can have unbalanced numbers in the clusters or even just one observation in some clusters.
This would typically be done using either a random effects logistic model, as can be fit in PROC GLIMMIX using its RANDOM statement, or with a Generalized Estimating Equations (GEE) model, as can be fit in PROC GEE (or GENMOD) using its REPEATED statement. The first is a subject-specific model and the second is a population-averaged model. So, the choice depends on what you want the model for - if it's to predict the result for individuals, then you probably want the GLIMMIX approach. If you want to make general conclusions about the population, this suggests the GEE approach. You can find examples in the Getting Started sections of the documentation of both procedures.
... and yes, you can have unbalanced numbers in the clusters or even just one observation in some clusters.
Thanks so much for the response. I'm looking through the documentation right now. Seems GLIMMIX might be best. To clarify a bit, I'm trying to predict the outcome ('0'/'1') based on patient characteristics, such as race, language, sex, age, substance use(0/1), mental health diagnoses(0/1), housing status(0/1). I've already run a basic logistic regression with patient-level data (whether or not a particular patient was restrained at least once over that 2yr period) while controlling for number of admissions. Since some patients were restrained multiple times over the date range, I'd also like to explore the model on the encounter-level, but this can't be done with basic logistic regression because a patient can have numerous encounters, so those observations will correlate within patient. I'm less interested in including the hospital site as a variable, but not opposed...it's an interesting secondary question.
-The patient-level research question was, "do certain patient characteristics predict restraint use in the ED?"
-The encounter-level question would be, "On encounters when a patient has these characteristics, do these characteristics predict restraint use?"
1. Am I able to cluster encounters within the patient? Or would it only be advisable to cluster within a different variable, such as site?
2. In the past I've used SPSS for MLM and remember having to switch the data into wide format. However, it looks like SAS can use long format data with these models?
You use long format. If your patients within a site are essentially independent and therefore uncorrelated, then you can just specify that the multiple observations within a patient is a cluster with SUBJECT=PATIENT.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.