BookmarkSubscribeRSS Feed
BlueNose
Quartz | Level 8

Hello,

I am trying to fit a binary mixed model using PROC Glimmix. I have a binary response variable getting the values 0 or 1. The value 0 indicates that some surgical treatment was up to 1 minute long, and the value of 1 means that it was 1 minute or more.

I have a treatment group with 3 levels: treatments 1 and 2 are controls, and treatment 3 is the new examined treatment. For each subject (animal in this case), several surgical procedures were performed.

There are two types or procedures for this surgery, however for each animal only one was taken, so there is a perfect correlation between the procedure type and the animal.

This is the code I used:

proc glimmix data=Animaldata method=quad;

  class Animal_ID Treatment;

  model Y = Treatment / dist = binomial link = logit oddsratio solution;

  random intercept / subject = Animal_ID;

  estimate "treatment 1 vs 3" treatment 1 0 -1 / or;

  estimate "treatment 2 vs 3" treatment 0 1 -1 / or;

  estimate "treatment 3 vs 1" treatment -1 0 1 / or;

  estimate "treatment 3 vs 2" treatment 0 -1 1 / or;

run;

It ran fine, however the odds ratios had an enormous confidence intervals, and I am not sure about what to do with it now, I'll provide some more details. From descriptive statistics point of view, these are the frequencies:

Treatment01
1114
2511
3287

Clearly treatment 3 is superior.

The output SAS gave me included the following:

1. "Convergence criterion satisfied"

2. Covariance parameter estimates for animal ID: 0.326 with S.E of 0.98

3. Type III test of fixed effect: F=5.59 P>F=0.0075

4. Estimates:

LabelEstimateS.EDFt valuePr>tOROR CI
Treatment 1 vs.34.281.37383.120.003472.5[4.51,1164.18]
Treatment 2 vs.32.380.92382.570.014310.8[1.65,71.44]
Treatment 3 vs.1-4.281.3738-3.120.00340.013[0.00085,0.221]
Treatment 3 vs.2-2.380.9238-2.570.01430.091[0.014,0.603]

On one hand, both CI do not include 1, which is good, however they are so wide that it can't be ignored. I need to report some results, and I do not know how to report such a thing.

Did I do something wrong, is there anything that can be done to shrink the CI (without breaking any statistical rules) ?

One more thing, if my response is binary, how would you make interpretation of the covariance parameter estimate ?

Thank you very much for any guidance.

6 REPLIES 6
1zmm
Quartz | Level 8

No, you didn't do anything wrong, as far as I can tell.

The 95% confidence intervals are wide because you have small sample sizes in your two control and one treatment groups.  The third and the fourth estimates in your list of output estimates show that the odds ratios for the duration of treatment one or more minutes long are much less common for treatment 3 than that for either treatment 1 or treatment 2.  The only way to narrow these confidence intervals is to increase the sample sizes in your study.

The covariance estimate for the random effect of animal is actually the (same) variance estimate for each animal because your data contain only one observation for each animal.  Although the standard error for this estimate is not the best estimate of its variability, because this standard error exceeds the value of the estimate, accounting for the random effect of each animal probably does not affect much the (fixed) treatment effects on the outcome.  If you wish, you can compare the output for the current model with that for a model that omits the random effect of animal (fixed-effect only model).  In summary, the relatively small variance estimate for each animal indicates a relatively negligible effect on the treatment comparisons.

BlueNose
Quartz | Level 8

Thank you 1zmm,

regarding the odds ratio, this is exactly the answer I was hoping for ! Smiley Happy

regarding the random effect, perhaps I did not understand you correctly. There are several observations per animal, the number of observations vary from 2 to 4. My difficulty is to understand the meaning of the variance. If the response was continuous or even ordinal, the variance would have some "practical" meaning, but intuitively I find it difficult to understand the meaning of the within animal variance when the values are only 0 and 1.

I get what you say about the standard error being larger than the estimate. I tried running a regular logistic regression, these are some of the results:

the OR for T1 vs. T3 was 0.018 [0.002,0.16], the OR for T2 vs. T3 was 0.114 [0.03,0.435]

in the maximum likelihood estimates part there was something "interesting":

T1: estimate= - 1.95 (p value wald test=0.007)

T2: estimate= - 0.1081 (p value wald test=0.834)

The global wald test is of course very significant. the area under the roc curve is 0.83.

1zmm
Quartz | Level 8

Sorry, I misread your original description of your experiment and did not realize that each of the animals in your study received 2 to 4 treatments.  However, your original description also states that you took only one surgical procedure for each animal so that there is a "perfect correlation between the procedure type and the animal".  I'm now confused about the meaning of "treatment" and the meaning of "surgical procedure type".  Are they the same or different?  Instead of selecting only one surgical procedure for each animal, would you find it valuable to model the pattern/order of surgical procedures each animal underwent?

In any case, the variance/covariance estimate displayed in the output from the RANDOM statement is the variability in the outcome associated with Animal_ID (?=procedure type) after accounting for the fixed effect of Treatment.   A dichotomous or binary outcome can have variability because you've specified a statistical distribution (the binomial distribution) for this outcome in your MODEL statement.  Even though each animal has only one outcome, the distribution of outcomes across all animals is specified as having a binomial distribution, which does have a mean and a variance.

If you wish, you can also estimate the value of the random effect for each animal by adding the option, SOLUTION, to the RANDOM statement:

    RANDOM intercept / subject=Animal_ID solution;

You did not provide the code for your PROC LOGISTIC program.  Perhaps the discrepancy between the reported significance probability of T2 vs. T3 and the odds ratio for T2 whose 95% confidence interval excludes 1.00 (and therefore should yield a statistically significant probability) results from using the default "effect" coding for treatment in the PROC LOGISTIC CLASS statement.

BlueNose
Quartz | Level 8

My logistic code was quite straightforward...

proc logistic data=Mydata;

  class Treatment;

  model Y = Treatment;

run;

So regarding the variance, a reader of the results that never studies probability will find it quite difficult to understand it, I mean, I know what V(X)=np(1-p) means, but unlike in the continuous case it's slightly difficult to explain it to others.

I tried the solution option, when you say random effect for each animal, do you mean within each animal ?

thanks for your help !

1zmm
Quartz | Level 8

The default coding for categorical variables in the CLASS statement (for example, Treatment) in PROC LOGISTIC is effect coding.  Preferable is the use of reference coding, which estimates the difference in the effect of each nonreference level to the effect of the reference level.  The difficulty with your example data is that either of your two "control" treatment levels (1 or 2) could be chosen as the reference level.  What I usually do to force the choice of a reference level is to include a prior PROC FORMAT paragraph VALUE statement for the classification variable that specifies this reference level:

     proc format;

         value treatfm 1="} Treatment 1 (ref)"

                             2="Treatment 2"

                             3="Treatment 3"

     run;

     proc logistic data=Mydata;

         class Treatment(order=formatted param=reference);

         model Y = Treatment / clodds=both;

         format Treatment treatfm.;

     run;

The right brace ["}"] in the formatted value for Treatment 1 in the VALUE statement of PROC FORMAT will sort this value last (ORDER=FORMATTED) and thus specifies this value as the reference level for the categorical variable, Treatment.  Reference coding like this may not cause the anomaly that you describe with a not statistically significant "P-value" occurring with 95% confidence intervals for the odds ratio that exclude the null value of 1.00.

Another approach to estimating odds ratios would be to use a CONTRAST statement in PROC LOGISTIC, a statement like the ESTIMATE statement you used in PROC GLIMMIX.  

If the mean of a binomial distribution estimates the proportion of Y=1 outcomes across all the observations, the variance would estimate the variability of these proportions across the observations and could be used to estimate a confidence interval around that mean, which resembles what you can do in the continuous case (mean and confidence interval around that mean).

The SOLUTION option of the RANDOM statement estimates the random intercept for the observation(s) identified in the SUBJECT option.  If SUBJECT=Animal_ID identifies one or more observations for one animal, this random intercept is the mean estimate of that value across these observations.  This value, along with the values estimated from the fixed effects, can be used to estimate the linear predictor ("best linear unbiased predictor") in the OUTPUT statement of PROC GLIMMIX.  Although this random estimate is "constant" within each SUBJECT=Animal_ID, differences in the fixed effects among the possibly multiple observations associated with each Animal_ID would yield different best linear unbiased predictors for these different observations:

     best linear unbiased predictor(Y) = b0[=fixed effect intercept] + b1*(Treatment=2) + b2*(Treatment=3) + z0i[=random effect intercept for Animal_ID]

         where b0, b1, and b2=fixed effects, and z0i=random intercept for SUBJECT=Animal_IDi.

jbaldwin
Obsidian | Level 7

Just to echo what 1zmm suggested:  required sample sizes for binomial models tend to be larger than what one considers adequate for a model with a continuous dependent variable.  But to get more specific than that fuzzy generality, one should (almost) always generate data from a known set of parameters and run it through PROC GLIMMIX.  Usually one would first try a "huge" sample size to see if the estimates of the parameters are what was used to generate the simulated data.  Once you're satisfied that the model is working correctly, try a sample size that matches what you have with your real data.  This is a good practice whether one uses SAS or R for your mixed models.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3172 views
  • 3 likes
  • 3 in conversation