BookmarkSubscribeRSS Feed
BlueNose
Quartz | Level 8

Hello everyone,

I need some help in setting up a couple of mixed models. I think that in the first one I need PROC MIXED and in the second PROC GLIMMIX, but I am not sure and doesn't know how to set the procedures, I will give you information.

The first case is the easy one. I have a repeated measures analysis, with a continuous dependent variable (let's call it Y), with 1 independent grouping variable with 3 levels (let's call it X), and for each subject there are between 2-4 samples (not related to time).

The data set looks like this:

ID  Y  X

1   y   x

1   y   x

2   y   x

2   y   x

and so on...

The second model I need is slightly more complicated. I have a dependent variable Y that gets values from 0 to 5, integer values. The variable represent time, when lower is better (for the researcher). Again X1 is a grouping variable with 3 levels, I want to prove that one of them is better than the others (has a lower values of Y). In this particular group, there are a lot of cases with a score of 0, so the data is not normally distributed. In addition, I have a variable X2 with 2 levels, which is a random effect (2 levels chosen randomly for the experiment out of many possible options). And in addition to all that, still, every subject has 2-4 observations like before, so it's also repeated measures design.

I use SAS 9.3, I also have SAS enterprise Guide 5.1 and JMP 10, however I think that if I learn how to do it with SAS using code it's better. I would really appreciate your help, I am lost here...

Thank you in advance !

20 REPLIES 20
SteveDenham
Jade | Level 19

The first looks pretty straightforward.  You will probably need to add a variable to the dataset to identify which observation within an id you are using.  The key here is whether the observations on each ID are the same.  What I'm getting at is you say you have 2 to 4 observations for each ID.  Are these aligned, so that you can take advantage of the repeated nature?  For instance, suppose they are observations on the legs of an animal.  There is a structure so that missing values should be included in the dataset.  On the other hand, if they are 2 to 4 observations without some structuring, you should probably consider them as a G-side effect rather than as an R-side effect.

So we have two sets of code:

/* G-side random effect  */

proc mixed data=yourdata;

class x id obslevel: /*where obslevel identifies the observations within an id*/

model y=x;

random intercept obslevel/subject=id type=cs;

run;

/* R-side repeated effect  */

proc mixed data=yourdata;

class x id obslevel: /*where obslevel identifies the observations within an id*/

model y=x obslevel x*obslevel;

repeated obslevel/subject=id type=cs;

run;

Now, I used type=cs, as I would assume that the variance is constant at each obslevel, and they have similar correlations.  If this does not fit your data, you may need to try some other error structures.

I'll hold off on the second analysis until we get the matter of the obslevels ironed out.  It will make a substantial difference in the approach to the analysis of the multinomial response.

Steve Denham

1zmm
Quartz | Level 8

You might try something like the following as a start:

    proc sort data=dataset;

        by id ordered_value_repeated_measure ;

    run;

    proc glimmix data=dataset order=internal order=internal;

        class X1 X2 Y;

        model y(order=internal descending ref=last)=X1 / dist=multinomial link=cumlogit ddfm=kenwardroger oddsratio;

        lsmeans X1 / cl diff adjust=tukey adjdfe=row;

        random X2 / solution cl gcorr;

        random ordered_value_repeated_measure / subject=id type=cs residual;

    run;

    quit;

The PROC SORT sorts the input data by ID and the ordered value of the repeated measure.

This PROC GLIMMIX model assumes an ordinal, integer-valued dependent variable, Y, which is distributed multinomially with a cumulative logit link and a fixed-effect independent variable, X1.  The value of Y=0 is defined as the reference category for the dependent variable.  The denominator degrees of freedom is adjusted by the Kenward-Roger method when both categorical fixed effects and random effects occur in the model.  Odds ratios are calculated as measures of effect in this cumulative logit model.  The independent variables, X1 and X2, and the dependent variable, Y, are categorical classification variables.  

The LSMEANS statement calculates the "adjusted" means of the dependent variable, Y, their differences, and their 95% confidence intervals for the levels of the categorical independent variable, X1; the significance probabilities for these mean differences are adjusted for multiple comparisons using the Tukey pairwise method.

The first RANDOM statement specifies X2 as a random variable across all study subjects.  The second RANDOM statement specifies the value of the ordered value of the repeated measure within each study subject, indexed by the variable, ID, as a repeated measure with the RANDOM statement option, RESIDUAL; the type of the residual variance-covariance structure is assumed to be that of compound symmetry.

BlueNose
Quartz | Level 8

Thank you both for replying.

I understand Steve's question, I checked it further and it's messier than I thought.

The example of animals was good since it is the case actually (a preliminary trial, thus wasn't planned with a statistician's help). Let's for illustration say that the 3 treatments I have (what I called X or X1) are given on different parts of the body (legs, hands,...). For each animal they were given on between 2-4 times. In many cases, 4 treatments were given, 2 with treatment 1 on the leg, and 2 with treatment 2 on the leg. In other cases, 3 were given, 2 with treatment 1 on the leg, and 1 with treatment 2 on the leg. And in some cases treatments were given in 2-3 locations "randomly" with no logic at all.

What I am trying to say is, that unfortunately, if I define a new variable representing the observation within an animal, it's value will be meaningless in compare to time or any other logical variable. For example, if I sort the data set and then give numbers to observations within a subject, an increment of the value (2 in compare to 1) will mean nothing, because 1 for a certain animal can be 2 for the other (do I make any sense?)

The other variable I have (X2) is the technique in which the treatment was given. Only 2 techniques were used out of more possible, thus it is a random effect.

I will try to represent the data more accurately now:

AnimalID Technique Location Treatment Response

    1               1            1            1             y1

    1               1            1            1             y2

    1               1            1            2             y3

    1               1            1            2             y4

    2               2            2            1             y5

    2               2            2            1             y6

    2               2            2            2             y7

    3               1            1            1             y8

    3               1            1            1             y9

    4               1            1            3             y10

    4               1            1            3             y11

and so on...

I have two responses: one is ordinal with 6 levels (what I defined as time earlier) and this is my main response, and another one continuous. In addition, the researcher claims that anatomically there is no difference between the locations and this factor can (should) be ignores.

I will be glad for suggestions...thank you very very much !

1zmm
Quartz | Level 8

Defining a new variable to distinguish observations within an animal is only meant as an arbitrary index for that observation.  The purpose of the new variable is to identify different observations to SAS.  Specifically, the second RANDOM statement in the SAS code that I listed before,

    random ordered_value_repeated_measure / subject=id type=cs residual;

uses the name of that new variable, "ordered_value_repeated_measure", to identify the repeated-measure observations within each animal subject identified by the variable, ID.  Because each repeated-measure observation is not ordered, the name I used for this new variable is not appropriate, and sorting by that variable within each animal in the prior PROC SORT step is no longer relevant.

Nonetheless, providing a name for this repeated-measure allows SAS to calculate the "effects" of the repeated measures in this model.  The use of a variance-covariance structure like compound symmetry (for example, TYPE=CS in this code) or unrestricted (TYPE=UN) does not depend on whether the repeated measure is ordinal or nominal; use of an autoregressive-order-1 structure (TYPE=AR(1)) or a spatial covariance structure (TYPE=SP(etc.)) would be inappropriate when the repeated measure is nominal, as it appears to be here.

In summary, defining a new variable to distinguish repeated-measure observations within an animal is still worthwhile.

SteveDenham
Jade | Level 19

I wish I had a dollar for every time a researcher told me that anatomically there was no difference, so we can just ignore an experimental design factor.

Anyway, I think I would consider location as a random blocking effect, and treat this as a split-plot design.  If you have access to Littell et al.'s SAS System for Mixed Models, 2nd ed., there are some excellent examples there.  Additionally, I don't know if I would consider technique as a random effect, but it won't hurt, except to make the standard errors for the treatment means larger.  So, for the continuous response, how about:

proc glimmix data=yourdata method=quad; /* Start with quad, if it doesn't work, then try laplace, if it still doesn't work, then the default */

class animalid technique location treatment;

model response=treatment;

random intercept/subject=animalid;

random intercept/subject=location*animalid;

random intercept/subject=technique*location*animalid; /* I don't know about this one, as it looks like location and technique are completely confounded in the example data */

lsmeans treatment;

run;

and for the multinomial:

proc glimmix data=yourdata method=quad; /* Start with quad, if it doesn't work, then try laplace, if it still doesn't work, then the default */

class animalid technique location treatment;

model response2=treatment/dist=mult link=clogit;

random intercept/subject=animalid;

random intercept/subject=location*animalid;

random intercept/subject=technique*location*animalid; /* I don't know about this one, as it looks like location and technique are completely confounded in the example data */

lsmeans treatment/ilink;

run;

Make sure that the data are sorted by technique location animalid, so that the containment hierarchy is maintained.

Steve Denham

BlueNose
Quartz | Level 8

Thank you for your help, I put my hand on a copy of the book you recommended, will read it.

I ran the suggested code and I have a few outputs.

First of all, in the ordinal case, SAS tells me I cannot ask for least square means, so I had to give it up. The exact error was:  "Least-squares means are not available for the multinomial distribution"

So I ran this code:

proc glimmix data=Animaldata method=quad;

  class AnimalID Technique Location Treatment;

  model Response=Treatment/dist=mult link=clogit;

  random intercept/subject=AnimalID;

  random intercept/subject=Location*AnimalID;

  random intercept/subject=Technique*Location*AnimalID;

  * lsmeans Treatment/ilink;

run;

The output I got was:

G-Side covariance parameters=3, columns in X=8, Columns in Z per subject = 5, subjects=26, max observations per subject=4 ; parameters in optimization=10, lower boundaries =3, upper boundaries=0, fixed effect=not profiled, quadrature points=5

Convergence criterion (GCONV=1E-8) satisfied

Estimated G matrix is not positive definite.

AIC = 164.3, AICC=167.5

-2 LOG L = 121.07

Covariance parameter estimates:

AnimalID = 0 (no S.E)

AnimalID*Location = 1.11 (S.E 1.41)

AnimalID*Location*Technique = 0.204 (no S.E)

Type III:

F=7.15 P.V=0.003

In order to get LSmeans I ran it again treating the response as Gaussian, and I got different covariance parameters (0,0.21,0.04 respectively), the F value was higher (25.88).  The least square means were 0.3,1.8,3.1 for treatments 1,2 and 3 respectively.

Should I be worried about the not positive definite G matrix ? Does the model look correct ?

If I had to describe the covariance parameter estimates in one sentence, how should it be interpreted ?

Thank you very much in advance, your help is most appreciated !

SteveDenham
Jade | Level 19

I forgot that glimmix does not support lsmeans for a multinomial distribution.  Try this instead:

proc glimmix data=Animaldata method=quad;

  class AnimalID Technique Location Treatment;

  model Response=Treatment/dist=mult link=clogit oddsratio;

  random intercept/subject=AnimalID;

  random intercept/subject=Location*AnimalID;

  random intercept/subject=Technique*Location*AnimalID;

run;

The non-positive G matrix says that at least one of the covariance parameters is zero.  This should not be a problem, but it means that all variabity is "used up" in the latter two parameters.  You could eliminate the first random statement, and output should be unchanged (possibly with the addition of the SE of the three way random effect).  The covariance parameters describe the variability due to technique within location within animal and to location within animal.

Steve Denham

BlueNose
Quartz | Level 8

Thank you again Steve, your help is incredible.

I ran the code you suggested and it kind of answered another questions I had - if the F test is significant and there is a difference between the 3 treatments, how do I know between which pairs ? Well, I guess the odds ratios can help here.

The odds ratios initially didn't do what I needed because it didn't put the group I am interested in as the default, so it wasn't compared to the other two, only to one, so I changed the order by changing the names of the groups, and this is what I got:

Treatment  Treatment OR_Estimate  DF       95% CI

     1               3               0.01          25    0.001,0.08

     2               3               0.04          25    0.011,0.228

Now I find it difficult to interpret the odds ratio in this ordinal case (is it like in nominal case ?)

Treatment 3 is the one I care about, the new treatment, 1 and 2 are the controls. Can I say that getting a lower score of the response is lower by 0.01 in group 1 in compare to group 3 ?

Before I changed the order of the groups, I got:

Treatment  Treatment OR_Estimate  DF       95% CI

     2               1               4.9          25     1.05,23.12

     3               1              100.3        25    21.5,468.48

In the two scenarios I got different F values in the type III analysis, and different P values...

SteveDenham
Jade | Level 19

I feel like there is something missing.  There should be odds ratios for the various scores by treatment.  Something like score=2:score=1, score=3:score=(1 or 2), score=4:score=(1 or 2 or 3), etc. that reflects the cumulative logit.  The odds ratios I see in your post look like the ratios of the LSmeans under the normal distribution.

Try adding solution after oddsratio in the model statement.  I think the odds ratios I am thinking about can be obtained from exponentiating the parameter estimates.  If that doesn't work, then we are going to have to try the ESTIMATE statement, and I know I will have to learn more before offering advice on that as a way to get at the odds ratios.

Steve Denham

1zmm
Quartz | Level 8

Your first pair of odds ratios use Treatment #3 as the reference group for the other two treatments.  In a cumulative logit model as SAS parameterizes it, these odds ratios show the odds of being in the lower category of the dependent variable outcome, Response, for Treatments # 1 and #2 relative to the odds of Treatment #3 being in that lower category.  Since these odds ratios are much less than 1.00, the odds of being in the lower Response category for both Treatments #1 and #2 are much less than the odds of being in the lower Response category for Treatment #3.

Or, as an alternative interpretation that focuses on Treatment #3, those receiving Treatment #3 are much more likely to be in these lower Response categories than those receiving Treatments #1 and #2:  About 100 times [=1/0.01] more likely to be in a lower Response category than Treatment #1 and about 25 times [=1/0.04] more likely to be in a lower Response category than Treatment #2.  You can specify the option, DESCENDING, on the PROC GLIMMMIX statement to determine these "reciprocal" odds ratios for being more likely for Treatments #1 and #2 to be in these upper Response categories than Treatment #3.

BlueNose
Quartz | Level 8

Hello Steve and 1zmm,

I ran Steve's latest idea, adding 'solution' to the code, and this is what I got (in addition to the previous output):

solutions for fixed effects:

EffectResponseTreatmentEstimateS.EDFtp value
Intercept01.630.602.71
Intercept13.260.7304.45
Intercept24.020.8104.92
Intercept34.540.8705.17
Intercept45.120.9505.36
TreatmentControl 1-4.61.0325-4.460.0002
TreatmentControl 2-3.010.74-4.040.0004
TreatmentNew Treatment0....

Odds ratio estimates:

TreatmentTreatmentEstimateDF95% CI lower95% CI upper
Control 1New Treatment0.01250.0010.084
Control 2New Treatment0.049250.0110.228

This output is slightly difficult for interpretation. Steve, is this the output you were looking for ? I also had the feeling that something was missing since I remember that for ordinal logistic regression there was an estimate for every value. By the way, my values are 0-5, and the output is 0-4, so is 5 the reference ?

About how 1zmm interpreted the previous output, this is the kind of conclusion I aim for and expect according to the data I have in hand.

Ane one last thing, Steve, about learning before giving advice, your advice and guidance are wonderful, in the past I had a different software and without mentioning names, the support community wasn't anything like that, so thank you very very much !

Addition (Edit 1): Just to see what happens, I changed the quad option to laplace and the results were "in the research's favor", i.e. smaller variance in the covariance parameters estimates. Is there a justification why quad is better than laplace ? Speaking of variance, the AnimalID*Location variance is 1.25. Is it correct to say the the variance of different animals within different locations is 1.25 ? If the response is from 0 to 5, 1.25 seems slightly high, doesn't it (s.d=1.11)

Addition 2 (Edit 2): I have "played" with the random effects, I changed the Animal_ID*Location and Animal_ID*Location*Technique, instead I tried only Animal_ID, and only Location, only Technique and Animal_ID*Technique, and I compared the AIC, AICC and BIC of all models. The lowest information criteria were found in Animal_ID only and in Animal_ID*Technique, does it mean anything ? (difference of 1.57 from the model we are discussing here)

Addition (Edit 3): I ran the estimate command: estimate 'Control 1-New Treat' Treatment 1 0 -1;

I did it 3 times for every combination, these are the results:

estimate of Control 1-New Treat = -4.6

estimate of Control 2-New Treat = -3.0

In both the new treat was (-1).

The exponentiated Estimates are 0.009 and 0.049 respectively.

SteveDenham
Jade | Level 19

Exponentiating the estimates of the various intercept values will give the odds ratios on the cumulative scale that I mentioned before.  These are averaged over the two treatments.  To get what you want, try (and I do NOT guarantee this will even run):

proc glimmix data=Animaldata method=quad;

  class AnimalID Technique Location Treatment;

  model Response=Treatment/noint dist=mult link=clogit oddsratio;/* Added the noint option trying to get separate "lines" per treatment */

  random intercept/subject=AnimalID;

  random intercept/subject=Location*AnimalID;

  random intercept/subject=Technique*Location*AnimalID;

run;

As far as the various models, I put a lot of stock in method=quad, as it models the random effects more completely than does method=laplace, which in a sense, provides a marginal model.  Also, selecting the smallest IC is probably a good method here.

Steve Denham

BlueNose
Quartz | Level 8

Thank you again Steve.

I tried your suggestion, and you are right, it did not run: "ERROR: An intercept can not be suppressed in models for multinomial data"

So that leaves me with the latest output. If you say that the odds ratios of the intercepts are averages of the treatments, than I can't use them for my conclusion, since my aim is to compare the treatments.

Will it be correct to say that the odds of being in the lower response category (0) for both treatments #1 and #2 are less by 0.01 and 0.04 respectively than the odds of being in the lower response category (0) for treatment #3 ? I know for a fact that treatment 3 has a much lower response mean, so this "direction" sounds logical. However, if I look at the intercepts category (5) is the reference and not (0), so I hope I am not confusing...

Can I also say that the estimated variance within the animals and locations within animals on the cumulative logit scale is 1.25 ?

In addition, the fixed effect of treatment was highly significant in the model (F=8.55, P.V=0.0015)


Edit 1: Let me be more accurate if I may, I hope I am starting to get this, is it correct to say the following:


log(Pr(y<=0)/Pr(y>0)) = 1.6316-4.608*X1-3.012*X2 (+random effect)


for treatment 1:


log(Pr(y<=0)/Pr(y>0))=-3  --> Pr(y<=0)/Pr(y>0)= 0.0497 -->Pr(y<=0)= Pr(y>0)* 0.0497, and thus the probability of having a score of 0 in treatment 1 is lower by 0.0497 than the probability of having a higher score, in compare to treatment 3 ? Probably this can be rephrased better...


Edit 2: Choosing the model with smallest IC is a good method unless AIC and BIC will choose different models... 🙂  A model with the animal_ID the only random effect give covariance parameter estimate with a low s.e of 1.14. A model with Animal_ID*Location gives a s.e of 6 for both random effects

SteveDenham
Jade | Level 19

I am going to avoid the questions of interpretation, and try again to get the model to run and give reasonable results.  What kind of output do you get when you remove the noint statement?  I realize now that there needs to be an intercept for each level, but hopefully this parameterization will also give a slope, and from there we can use the estimate statement to get the odds ratios of interest.

As far as choice of IC for model ranking, it is a largely a matter of opinion.  For small to moderate sized datasets, with small to moderate numbers of covariance parameters, AIC is optimal in my opinion.  I realize others have different opinions on this, and am willing to hear their arguments.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 4453 views
  • 8 likes
  • 3 in conversation