BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RyanSimmons
Pyrite | Level 9

Hello all,

I have always had a little bit of trouble with the ESTIMATE and related statements in SAS, because I've always found the syntax super non-intuitive, especially since it is partially proc-dependent on you must specify things.

In any case, I am about 75% confident I have specified things correctly in my case, but I wanted to double-check to make sure my logic and understanding of what SAS is doing is sound. Essentially, I have a dataset with a binomial outcome for individuals in 3 groups at 4 time points. The specific comparison of interest is to compare group 1 with the average of groups 2 and 3 at the first time point (whether this is theoretically justified is another discussion altogether). Here is some example code:

data test;

input time group @;

do i=1 to 6;

input y @;

output;

end;

datalines;

1 1 0 0 0 1 1 1

1 2 0 . 1 . 0 0

1 3 0 1 . 0 0 0

2 1 0 . 0 0 0 1

2 2 . 0 0 0 . 0

2 3 1 0 0 0 1 1

3 1 . . 1 0 . 1

3 2 . 1 1 1 1 1

3 3 0 1 . 1 1 .

4 1 0 . 1 0 1 0

4 2 0 1 1 1 0 0

4 3 0 1 0 1 0 .

;

proc genmod data=test desc;

     class i cohort(ref='3') time(ref='1') / param=ref;

     model y = group time group*time / dist=bin link=logit type3;

     estimate 'G1 vs G2+G3, at T1'  / e exp;

run;

Now, the logic here is that the model is parameterized as Y=B0+B1*Group1+B2*Group2+B3*Time2+...(etc.).

Therefore, if I am looking to compare group 1 versus the means of group 2 and 3, this is equivalent to testing:

     Group 1 - 0.5*(Group 2 + Group 3) = 0

     (B0+B1) - 0.5*(B0+B2+B0) = 0

     B0+B1-B0+0.5*B2 = 0

     B1-0.5*B2 = 0

Since I am only looking at time 1, all other coefficients can be set to 0.

Thus, it seems to me that I only need "group 1 -0.5" in the estimate statement to get the exact contrast that I need. Based on the output from the 'e' statement, it seems to me that this IS doing what I want, but I am not 100% sure if it is, or whether or not this is the correct route to go.

Thanks in advance for any suggestions/advice.

Fixed mistake in DATA step

1 ACCEPTED SOLUTION

Accepted Solutions
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

You don't have the syntax right. I am assuming that cohort in the class statement should be Group. Your data step won't work because you forgot the INPUT statement. But getting at your problem:

To understand your needed syntax for ESTIMATE, write out your model for each group at time 1:

mu11 = int + G1 + T1 + G1T1

mu21 = int + G2 + T1 + G2T1

mu31 = int + G3 + T1 + G3T1

You want mu11 - (mu21 + mu31)/2. So, substitute the full equations for the mu values, and do some algebra. You get:

G1 - 0.5G2 -0.5G3 + G1T1 - 0.5*G2T1 - 0.5G3T1.

I show below how to do this with positional syntax and nonpositional syntax with ESTIMATE (which works on the model parameters), and how to do with with the LSMESTIMATE statement (which works with the means, so you don't need to put all the parameter components). Here I used normal distribution, but this is not relevant. For positional syntax, the order is very important. This code is for the order of terms in my class statement. That is, I write Group followed by Time  (Time is then sorted within each Group).

proc glimmix ;

    title2 'test case with three groups and 4 times';

    class G T;

    model y = G|T  ;         

    lsmeans G*T ;

    estimate 'G1 vs rest at T1, pos.' G 1 -0.5 -0.5 G*T 1 0 0 0  -0.5 0 0 0  -0.5 0 0 0 ;

    estimate 'G1 vs rest at T1, nonpos.' G   [1,1]  [-0.5, 2]  [-0.5, 3]  G*T [1,1 1]  [-0.5,2 1]  [-0.5,3 1];

    lsmestimate G*T  'G1 vs rest at T1, nonpos.'  [1,1 1]   [-0.5,2 1]   [-0.5,3 1];

run;

All three statements give the same result.

View solution in original post

11 REPLIES 11
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

You don't have the syntax right. I am assuming that cohort in the class statement should be Group. Your data step won't work because you forgot the INPUT statement. But getting at your problem:

To understand your needed syntax for ESTIMATE, write out your model for each group at time 1:

mu11 = int + G1 + T1 + G1T1

mu21 = int + G2 + T1 + G2T1

mu31 = int + G3 + T1 + G3T1

You want mu11 - (mu21 + mu31)/2. So, substitute the full equations for the mu values, and do some algebra. You get:

G1 - 0.5G2 -0.5G3 + G1T1 - 0.5*G2T1 - 0.5G3T1.

I show below how to do this with positional syntax and nonpositional syntax with ESTIMATE (which works on the model parameters), and how to do with with the LSMESTIMATE statement (which works with the means, so you don't need to put all the parameter components). Here I used normal distribution, but this is not relevant. For positional syntax, the order is very important. This code is for the order of terms in my class statement. That is, I write Group followed by Time  (Time is then sorted within each Group).

proc glimmix ;

    title2 'test case with three groups and 4 times';

    class G T;

    model y = G|T  ;         

    lsmeans G*T ;

    estimate 'G1 vs rest at T1, pos.' G 1 -0.5 -0.5 G*T 1 0 0 0  -0.5 0 0 0  -0.5 0 0 0 ;

    estimate 'G1 vs rest at T1, nonpos.' G   [1,1]  [-0.5, 2]  [-0.5, 3]  G*T [1,1 1]  [-0.5,2 1]  [-0.5,3 1];

    lsmestimate G*T  'G1 vs rest at T1, nonpos.'  [1,1 1]   [-0.5,2 1]   [-0.5,3 1];

run;

All three statements give the same result.

RyanSimmons
Pyrite | Level 9

Yes, you are right, "cohort" was a typo, that should be "group".

Not sure what you mean about the DATA step, because I do have an INPUT statement in there.

Now, your solution is actually a different model from mine. You've re-parameterized it completely. The 'ref' lines in my GENMOD step have the effect of absorbing the group 3 at time 1 effect into the model intercept. My model, to borrow your notation, is:

Y = int + G1 + G2 + T2 + T3 + T4 + G1*T2 + G1*T3 + G1*T4 ...

Which is different from your model of:

Y = int + G1 + G2 + G3 + T1 + T2 + T3 + T4 + G1*T1 + G1*T2 ...

While your algebra is correct, it doesn't contradict mine, since my model lacks both the group 3 effect coefficient and all group*time 1 effect coefficients. Eliminating those, your test reduces to G1-0.5*G2, which is exactly what I had initially.

Now, I suppose we could have a discussion on which is a better way to frame the model, but they are still fundamentally different models with different interpretations for the parameters. Your model measures the parameters as deviations from the overall mean, whereas my model measures the parameters as deviations from the mean of group 3 at time 1. I personally prefer the latter paradigm in the case that you are dealing with control and intervention groups (as I am here; though it is a long story and besides the point about what the groups are), as it is more natural to be implicitly comparing the intervention group with the control group directly, rather than comparing both against a grand mean as in your model. Although, I am open to any other suggestions as to why your model might be preferable.

I appreciate your GLIMMIX example in-so-far as it does help me understand the estimate syntax better in general, but it doesn't actually address whether my model was correctly specified.

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

You need an INPUT statement before time in your data step (in addition to the other INPUT statement). When I copied and pasted your direct code in SAS, it generated an error. I missed that you were using reference parameterization. Just overlooked that. Sorry. Your coding for an estimate statement appears correct for this parameterization.

With that said, I personally have a strong dislike of the reference parameterization with factorials, but it may be fine for your purposes. Meaning of tests of main effects (and interactions?) is strained, at best. There has been several discussions of this in the past at this site. Type3 tests are not testing the equality of main effect means to each other. In fact, expected values may not be defined with this parameterization. Try putting in a LSMEANS statement with reference parameterization, and you will get an error message that one needs the GLM parameterization for expected values, including the use of LSMESTIMATE. The GLM parameterization is actually not comparing the groups to the mean, but to the "last" level (or levels) of the factor (this is because GLM is an overparameterization). Contrasts can give all the needed other comparisons.

Of course, many find reference parameterization useful, and it may be fine for your needs.

RyanSimmons
Pyrite | Level 9

Oops, I see what you mean now. My mistake on that DATA step.

Anyway, thank you for your input. I will definitely think more about the parameterization and what it actual means for the analysis. Thanks!

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

Using your reference parameterization and your contrast statement, one gets the same result as using my GLM parameterization and my contrast statement. This needs to occur because the contrast here is basically a so-called simple effect of interaction means. For these kinds of questions, either parameterization (with corresponding correct contrasts) will give valid results. My concerns about reference parameterization with factorials has to do with meaning of test statistics for main effects, expected values, and so on.

RyanSimmons
Pyrite | Level 9

Hm, maybe I'm missing something, here, but they don't give IDENTICAL results, since the parameters have different interpretations. Using your code, the estimate of the group 1 versus the other two groups at time 1 is 0.275. whereas using my code the estimate is 0.776. They do, however, have nearly identical values for their test statistics and p-values, although (as you say) the meaning of these is different.

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

Actually, the estimate is -1.24 on a logit scale using both approaches. These must agree.  You can calculate this by hand by looking at theinteraction means with the GLM parameterization. Be careful, my estimate syntax assumes that the class statement is G T (not T G). The type3 test statistics are very different for the main effects.

proc genmod data=test;

    class G  T  / param=glm;

    model y = G|T / dist=bin type3 ;           

    lsmeans G*T ;

    estimate 'G1 vs rest at T1, pos.' G 1 -0.5 -0.5 G*T 1 0 0 0 -0.5 0 0 0 -0.5 0 0 0;

    lsmestimate G*T 'G1 vs rest at T1, nonpos.' [1,1 1] [-0.5,2 1] [-0.5,3 1] / ilink;

run;

   

proc genmod data=test;

    class G (ref='3')  T (ref='1')  / param=ref;

    model y = G|T / dist=bin type3 ;            *<--sometimes it is useful to look at estimated EFFECTS (alpha, etc.);

    estimate 'G1 vs rest at T1' G 1 -0.5;

    run;

RyanSimmons
Pyrite | Level 9

Ah! You're right. My mistake. I accidentally specified "exp" instead of "ilink" for transforming the estimates!

Anyway, thank you very much for your help on this. It has given me a lot to think about, and has helped me understand the underlying SAS machinery better.

RyanSimmons
Pyrite | Level 9

So I've been fiddling with both parameterizations more, and still have some questions/qualms about the GLM parameterization.

Since it overparameterizes the model, it induces linear dependencies in the design matrix. SAS essentially induces it into a full rank matrix by setting the last category of each class variable to 0 (and thus all interaction terms). In effect, it is simply setting the last category of each variable as the reference category. How is this any different from me specifying the reference parameterization but with G(ref='3') T(ref='4')? From quickly viewing the output, it seems that these give entirely identical results. In which case it seems to me that using the reference parameterization is preferable, only in-so-far as it allows you to choose which category is the referent (and thus choose when and where you need to deal with the interaction terms explicitly when using ESTIMATE), whereas the GLM parameterization essentially just forces it to always be the last category. Naively, it doesn't appear to me to change the interpretation of the coefficients except with respect to which level of the class variable you are implicitly comparing the others to, which then becomes a question of which contrast is most meaningful for your study, in which case the reference parameterization would seem to be better in giving you the flexibility to design your analysis around a specific contrast. Thoughts?

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

I can see why you would have lots of questions about the two kinds of parameterizations. Justification for the GLM parameterization would take lengthy explanations, not suited to a discussion answer (at least I can't think of easy and short explanations -- maybe others can). This all goes back to the 1970s when Jim Goodnight (founder and still CEO of SAS) was deciding on the parameterization for PROC GLM (and others). Basically, they needed to come up with a general approach that would work for any factorial, including nesting, and so on. He and colleagues made it quite clear why the GLM (over-)parameterization was needed, in general, for factorials. There are some old papers and reports that I don't have available right now. It has to to with estimable functions, a concept that is often difficult to understand. For the over-parameterization, it looks like those zero parameters are not there, but they really are. SAS does not just find the last levels (or whatever you make the reference) and makes them 0. They are present in the model fitting. But the generalized inverse of the matrix in the model-fit step ends ups with 0s for those terms (the constraint used to get the inverse). But they are still there. This may seem to be the same thing, but it's not. One cannot define expected values of main effects (and maybe other effects) unless those zero parameters are there. Likewise type3 (or other) tests of hypotheses (especially for main effects) can only be interpreted in terms of expected values (means) with the GLM parameterization.

If you are only interested in the highest-level interaction, you don't need to be concerned about this (other than in defining the contrast of interest). Or if you have only one factor. Also, if all the variables in your model are factors, then all of these parameterizations are giving you the same model fit (just with different coefficients). But some things cannot be done with the reference parameterization. By the way, you can control the reference level with the GLM parameterization (I show this below). I can get LSMEANS, etc., etc., with this parameterization. (Be careful: SAS rearranges the parameters. See in the Solution table that T1 goes last. This changes the order of things in the estimate statement). This looks very much like your output with reference parameterization, but it has the subtle differences I describe above. Also, not so subtle differences (the type3 tests are totally different for the main effects of G and T with reference and glm parameterization).

Finally, only GLM parameterization is possible directly with PROC GLM, MIXED, and GLIMMIX, the flagship procedures in SAS for factorials. This is not an accident. This is the parameterization that works best, and makes the most sense, as a general framework for factorials. For your application, it doesn't matter.

proc genmod data=test;

    class G (ref='3') T (ref='1')  / param=glm;

    model y = G|T / dist=bin type3 ;           

    lsmeans G*T ;

    estimate 'G1 vs rest at T1, pos.' G 1 -0.5 -0.5 G*T 0 0 0 1  0 0 0 -0.5  0 0 0 -0.5 ;

    lsmestimate G*T 'G1 vs rest at T1, nonpos.' [1,1 4] [-0.5,2 4] [-0.5,3 4] / ilink ;

run;

RyanSimmons
Pyrite | Level 9

Thank you very much. Very thorough response. I am going to try to dig up some of those old papers and reports. You are right that for this specific application it doesn't seem to matter, but I am interested in exploring the issue more broadly for future modeling. If you could point me towards any sources, it would be much appreciated. In all of my masters-level statistics classes, we used the equivalent of the ref parameterization almost exclusively, so I am especially interested in looking into this to improve any inefficiencies in the way I construct these sorts of models.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 5352 views
  • 8 likes
  • 2 in conversation