BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
natanya
Calcite | Level 5
Hello,

Will running proc glm on ranked data with contrasts/estimates, give me accurate non-parametric pairwise comparisons (Dunn/Bonferroni)? The n is small 15/gp with 5 groups.

The code looks something like this:

proc rank data=a out=ra;
var xx;
ranks rxx;
run;

proc sort data=ra;
by group;
run;

proc glm data=ra order=data;
class group;
model rxx=group;
means group/bon;
lsmeans group /adjust=bon pdiff cl ;
estimate "gp1 vs gp2" group 1 -1 0 0 0 /e;
estimate "gp2 vs gp4" group 0 1 0 -1 0 /e;
estimate "gp3 vs gp5" group 0 0 1 0 -1 /e;
run;
quit;

I found a macro written by Paul Juneau which calculates nonparametric pairwise comparisons using Dunn's method. Here is the link http://www.misug.org/misug_pastproceedings.html - scroll down to june 2004.
When I ran this macro I got different pairs significant (or not significant) than when I used the glm code on ranks.
What is the difference and what is the more correct approach?

Thanks in advance.
1 ACCEPTED SOLUTION

Accepted Solutions
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
It looks like you have a one-way layout, which means you can use the Kruskal-Wallis test (in the NPAR1WAY procedure). I bet the macro you found is using the results from this KW analysis to do the pairwise comparisons. Be very careful in using parametric methods with ranks (as you are doing with GLM). There are several books that tell you to simply perform parametric (e.g., ANOVA) methods on the ranks when there is not a customized nonparametric method available. But there is a lot of good statistical theory from the last 15 years to show that this is very misleading. With a 1-way layout, there is not much of an issue (your situation), but when you get to factorials, there are all kinds of problems with this "rank transformation method". A very good (but technical) description/synthesis of appropriate methods is in Brunner and Puri (2001) in the journal Statistical Papers (vol. 42, pages 1-52). The basic problem is that with ranks, one get unequal variances for the different groups, even if the variances were all equal with the original (unranked) data. Thus, one needs test statistics that take account of the variance heterogeneity. This can be done with the MIXED procedure, with the ANOVAF option in the MIXED statement, and with other options to specify unequal variances. For your example, I would use:

proc mixed data=ra order=data ANOVAF ;
class group;
model rxx=group;
repeated / group=group type=un(1); *<--use this even though you do not have repeated measures;
lsmeans group /adjust=bon pdiff cl ;
estimate "gp1 vs gp2" group 1 -1 0 0 0 ;
estimate "gp2 vs gp4" group 0 1 0 -1 0 ;
estimate "gp3 vs gp5" group 0 0 1 0 -1 ;
run;

Ignore the regular F statistic, but use the test results from the additional table that is generated in the output.

View solution in original post

9 REPLIES 9
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
It looks like you have a one-way layout, which means you can use the Kruskal-Wallis test (in the NPAR1WAY procedure). I bet the macro you found is using the results from this KW analysis to do the pairwise comparisons. Be very careful in using parametric methods with ranks (as you are doing with GLM). There are several books that tell you to simply perform parametric (e.g., ANOVA) methods on the ranks when there is not a customized nonparametric method available. But there is a lot of good statistical theory from the last 15 years to show that this is very misleading. With a 1-way layout, there is not much of an issue (your situation), but when you get to factorials, there are all kinds of problems with this "rank transformation method". A very good (but technical) description/synthesis of appropriate methods is in Brunner and Puri (2001) in the journal Statistical Papers (vol. 42, pages 1-52). The basic problem is that with ranks, one get unequal variances for the different groups, even if the variances were all equal with the original (unranked) data. Thus, one needs test statistics that take account of the variance heterogeneity. This can be done with the MIXED procedure, with the ANOVAF option in the MIXED statement, and with other options to specify unequal variances. For your example, I would use:

proc mixed data=ra order=data ANOVAF ;
class group;
model rxx=group;
repeated / group=group type=un(1); *<--use this even though you do not have repeated measures;
lsmeans group /adjust=bon pdiff cl ;
estimate "gp1 vs gp2" group 1 -1 0 0 0 ;
estimate "gp2 vs gp4" group 0 1 0 -1 0 ;
estimate "gp3 vs gp5" group 0 0 1 0 -1 ;
run;

Ignore the regular F statistic, but use the test results from the additional table that is generated in the output.
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
My sas code got cut off, so here it is again:

proc mixed data=ra order=data ANOVAF ;
class group;
model rxx=group;
repeated / group=group type=un(1);
lsmeans group /adjust=bon pdiff cl ;
estimate "gp1 vs gp2" group 1 -1 0 0 0 ;
estimate "gp2 vs gp4" group 0 1 0 -1 0 ;
estimate "gp3 vs gp5" group 0 0 1 0 -1 ;
run;
Dale
Pyrite | Level 9
lvm,

I was not aware of the ANOVAF option in the MIXED procedure until you pointed it out. When you indicated that the problem with rank-based methods is that the variances are heteroscedastic, my initial reaction was that one could employ the EMPIRICAL option with the MIXED procedure to construct robust estimates of the covariance structure. Since there are deficienies with the classical sandwich variance estimator, further consideration suggests use of the GLIMMIX procedure instead to construct improved sandwich variance estimates (e.g., EMPIRICAL=HC3 or EMPIRICAL=MBN).

I have not had time to study the ANOVAF option and how that would protect against misspecification of the residual variance structure. But I wondered if you have thoughts on the use of the ANOVAF option as opposed to using a robust covariance structure to account for residual variance which is not constant. What do you think of the code:

proc glimmix data=ra order=data empirical=hc3;
  class group;
  model rxx=group;
  random _residual_;
  lsmeans group / adjust=bon pdiff cl;
  estimate "gp1 vs gp2" group 1 -1 0 0 0 ;
  estimate "gp2 vs gp4" group 0 1 0 -1 0 ;
  estimate "gp3 vs gp5" group 0 0 1 0 -1 ;
run;

I am also curious about your REPEATED statement which does not specify an effect nor does it have a subject specification. Is that correct or does that need some modification?
Dale
Pyrite | Level 9
Duplicate post condensed down to this note. Message was edited by: Dale
Dale
Pyrite | Level 9
Duplicate post condensed down to this note. Message was edited by: Dale
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
Dale, this discussion is getting off the subject raised by the original poster, but I will expand a little. Your suggestion is a good way to come up with variance or covariance estimates that are not too dependent on the model chosen. But these variance and covariance (depending on the model) are used directly in the formulas for the Wald (chi-square) or scaled Wald (F) statistics calculated in MIXED and GLIMMIX. The ANOVAF option in MIXED (not in GLIMMIX) does something different: it calculates a different test statistic (a different formula: see the MIXED User's Guide) based on whatever variances and covariances are estimated. That is, one gets the regular table of Wald test statistics, but also another table of the "Anova-Type-Statistics" (ATS) (with estimated df_N and df_D based on the variances and covariances). This test statistic also has an F distribution asymptotically (but it works great at small sample sizes). The ideas go back many years, and one can read some of this in the article by Brunner and Puri I mentioned (and the citations therein), as well as in the excellent book by Brunner et al. (Nonparametric Analysis of Longitudinal Data in Factorial Experiments). A more recent technical paper (on a more complex layout) by A. Bathke et al. (The Amer. Stat., vol 63, pages 239-246 [2009]) may be of interest. A less technical article for biologists is by D. Shah and L. Madden in the scientific journal Phytopathology (vol. 94, pages 33-43 [2004]).

For one-way layouts, there have been several tests for a while. Even in GLM procedure, the Welsh option on the MEANS statement gives one another test statistic that is more customized to unequal variances. The ANOVAF option in MIXED gives a different test, but is in the same spirit. The ANOVAF can deal with a wider range of models.

For the technical details, one does NOT need a subject option on the REPEATED statement for this situation (in MIXED), or for any situation where all the observations are independent. Your use this statement just to get a different variance (using group option) for each group. With correlated data (split plots and repeated measures), the statement is more complicated, and a group and a subject option are needed, with a type=un (usually).

The theory by Brunner et al., going back to the theory by Box (1954) is all moment based. Thus, one is really using the MIXED procedure simply as a crank to work through the calculations. (There are also macros available that do everything in IML,and also R scripts). Although I didn't list it, I should have put method=mivque0 on the MIXED statement to get moment-based estimates. It can blow up without this.

You might rightly point out that the Wald statistics perform pretty well with unequal variances. That is true, but with ordinal data, however, there can be many ties, and 0 variances for some groups. This really causes problems with the Wald statistics (things can go to infinity easily). But the ATS (ANOVAF option) is designed to work just fine under these circumstances (ill conditioned covariance matrix, even a singular covariance matrix).

Regarding your code with GLIMMIX, the empirical option won't give you separate variances for each group. Actually, your random _residual_ statement will result in two residual variances (the regular one and another one fixed at 1). You would need
random _residual_ / group=group subject=group;
to get what you want. And with the empirical option, you do need the subject= option .
Dale
Pyrite | Level 9
lvm,

Thanks. I'll have to read more about the ANOVAF option as time permits. I agree that we have departed in some ways from the OP's question. But I hope that the discussion is informative and of some value to all who follow this thread.

I still have questions about your REPEATED statement because you specify TYPE=UN(1) without a SUBJECT specification. Your response indicated that a SUBJECT specification was not needed when all observations are independent. My understanding of the REPEATED statement is that no SUBJECT specification is needed if the residuals can be assumed IID (within any group structure). When some structure other than IID (within group) is specified, a SUBJECT specification is necessary. The TYPE=UN(1) specification leads me to believe that a diagonal covariance matrix is to be fit with separate residual variance estimates for each observation (which would be an overparameterized model).

I just simulated some data to try your REPEATED statement with and without the TYPE=UN(1) specification. It turns out that results are identical. The TYPE=UN(1) option really does nothing here except relabel the name of the covariance parameter in the table of covariance parameter estimates. The TYPE=UN(1) specification is superfluous - harmless, but to me misleading about the residual covariance structure.

I have to disagree on several points about the GLIMMIX code. When observations are assumed independent, then it is not necessary to specify a SUBJECT= specification when the EMPIRICAL option is specified. This case is comparable to using the ACOV option with PROC REG to get the Huber-White sandwich covariance estimates. There is no SUBJECT specification when the ACOV option is employed with PROC REG because all observations are regarded as separate subjects. The same is true for the GLIMMIX procedure with EMPIRICAL option and the RANDOM _RESIDUAL_ statement which I specified. (You are correct to point out that the RANDOM _RESIDUAL_ statement is not necessary here. The extra "residual variance" term in this case is a multiplicative scale factor with value 1.0. It does not actually affect the results to employ a RANDOM _RESIDUAL_ statement. Just like the TYPE=UN(1) specification for your REPEATED statement in PROC MIXED, the RANDOM _RESIDUAL_ statement here is superfluous - harmless, but misleading about the covariance structure.)

I would note that the EMPRICAL option takes care of misspecification of the residual variance structure - after the fact of model estimation. We specifically make no attempt to model the residual covariance structure, allowing the robust sandwich covariance estimates to take care of the covariance structure misspecification. Whether use of the EMPIRICAL option with the GLIMMIX procedure or the ANOVAF option of the MIXED procedure is the best approach to employ is certainly open to argument and not something that I am really informed on at this point in time.
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
Dale, I actually agree on most of your points. It is true that you don't need UN(1) for this situation. I used the UN(1) specification merely for comparison with more complex situations (with correlations), where type=UN (or some other choice) would be necessary.

You definitely do not need a subject option with REPEATED for this simple situation. Check out the examples on pages 350 and 352 (sections 9.2.2, 9.2.3) in SAS for Mixed Models, 2nd edition (Littell et al. [2006]). You definitely need a subject option with correlated data, of course, as in true repeated measures, split-plots, and so on.

I actually agree with your general statements about the use of EMPIRICAL options in GLIMMIX. I just didn't write enough to fully explain myself, whether dealing with GLM mode vs. GLMM mode, and so on.

As I wrote before, there is a fundamental difference in philosophy (and approach) with the use of the ANOVAF option versus the robust estimation of the variance-covariance matrix, with the former specifying a different test statistic for factor effects.Both have many advantages.
natanya
Calcite | Level 5
Thank you very much Dale and lvm. I am happy to see that my questions has sparked such an ongoing discussion of great minds. As to my original question, since my data is one-way and strait forward, the procedures suggested including my original suggestion have generated similar results and so I am comfortable with my conclusions. Your discussion will require me to do much research - which, as time permits, I will get to. In the meanwhile, thanks for letting us all "listen in" and learn new things.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 10606 views
  • 0 likes
  • 3 in conversation