11-28-2016 08:47 AM
I would like to ask for your assistance with setting up a model using GLIMMIX.
My data involves births in a hospital. The variable of interest is whether the mother stayed in the hospital motel or not (1/0). My data contains around 40,000 observations, from which almost 30,000 did not stay in the motel and the rest did. Each line in the dataset represents a birth (with 1 baby or more). Some rows represents births of the same mother, therefore I have a variable called MotherID, and this is the reason why I need a random effect logistic regression here. The independent variable list if quite long, with approximately 25 possible variables for me to choose from - some continuous and some categorical. I am now filtering some out due to high correlations between them. Some variables are birth related and some are mother related, for example: Socioeconomical status of the mother (numeric), number of babies in the current birth (1 to 4), birth number in this hospital (first, second, ...), did the mother take a birth preparation course (yes/no), was the mother in emergency room prior the birth (yes/no), pregnancy week in birth time (numerical), number of days in emergency room (numerical), etc,....
For the sake of this discussion, let's call the categorical variables C1, C2, .... and the numeric ones X1, X2, ...
I wish to set up a model using GLIMMIX, a random effect logistic regression. I am not sure how to write it ( will give my attemp below ).
In addition, the sample size is fairly large. This means that small effects will give significanct P-Values. I need to ask SAS to give me some measure of effect size, so I can know if a significant P-Value is actually interesting or not. I also want to produce probabilities of going to the motel, based on the variables in the final model.
Regarding a final model, should I enter variables manually, or is there an automatic way (stepwise, etc...)? I am not too keen to count on automatic ways...
My initial code is:
proc glimmix data = motel method = quad; class C1 C2 C3; model Motel = X1 X2 X3 C1 C2 C3; random int / subject = MotherID; run;
Is this code correct? How do I add effect sizes, predicted probabilities, confidence intervals, etc?
Thank you in advance !
12-01-2016 08:46 AM
That skeleton code will get you started, once you add /dist=binary to the MODEL statement. You may need to consider adding interaction terms to get at your questions of interest.
Make sure your input dataset is sorted by MotherID, and that MotherID is strictly numeric, as you are incorporating it as a subject effect, and it is not included in the CLASS statement.
As far as the other things:
Effect sizes in mixed models are not easily calculated. Google is your friend here--there are some articles on lexjansen.com and elsewhere on the web that might help, but there is really no consensus on what the optimum method might be. And whichever one you choose, almost certainly at least one reviewer will disagree
As far as predicted values and confidence intervals: Do you want CLs for the model parameter estimates? if so add CL after dist=binary. If you want predicted values at various levels of the continuous variables, for each level of the categoricals, look into the documentation for the LSMEANS statement. In a sense, these are NOT predicted values, but rather marginal means. Use of the STORE statement in MIXED and then the SCORE statement in PROC PLM would give you true predicted values for a dataset.
But that can be searched here in this forum. Check posts by @lvm for advice on this.
12-02-2016 03:27 PM
Thank you Steve.
I tried running the code, and initially I got a log error about not having enough memory. My guess is, that I have too many subject ID's.
I looked on Google and found that I needed to add nloptions tech=nrridg; and ddfm=bw, I admit that I don't understand what it does, but it did run.
When I used method = quad, for a one variable model, I got an OR of 11. When I used method = laplace I got an OR of 4. Without specifying a method, I got OR of 3.2. If I run a logistic regression without considering the random effect, or if I use a two by two table, I get an OR of 3.2. Since a vast majority of my subjects do have only 1 observation, I find it hard to believe that 11 is the correct OR. Do you have an explanation what cause quad and laplace to get the OR wrong? I have to say, when seeing this result, I tried running the model with another software, and surprise surprise, I got OR = 11 again, which is incorrect if you ask me.
12-09-2016 09:14 AM
Those values for the OR are what you would expect given:
No method gives REML values which are biased toward 1, as these are marginal values.
Method=laplace uses a single point of support when calculating the integrals for the maximum likelihood, so in a sense, it is intermediate between the marginal REML values and the best approximated conditional values obtained from method=quad.
Of your values, I would trust the OR=11 as the best description of your data.
I would say that there is something unusual about the subjects that result in multiple observations as compared to the single observations. I might even go so far as trying to model a factor that separates subjects into these two groups, and seeing what the behavior of the model and the resulting OR values might be.