Interpreting the Lines or Connecting Letters Report

I had an analyst contact me the other day. He was conducting what seemed to be a straight-forward one way analysis of variance (ANOVA) and ran into a problem. He requested the LINESTABLE option and saw this result:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

These results concerned him, and he was confused on how to best explain this. I will discuss some ways to understand and explain these results in this post.

The analyst in this situation had a small dataset. Although I do not have the data, we can make a small example that will yield similar results. This program is one simple way to illustrate the issue:

data example;
input group $ response;
datalines;
Level_1 46
Level_1 50
Level_1 54
Level_2 56
Level_2 60
Level_2 64
Level_3 66
Level_3 70
Level_3 74
;
proc print data=example;
proc anova data=example;
   class group;
   model response=group;
   means group/tukey lines linestable;
run;

Notice that I have added the LINES and LINESTABLE options to the means statement. The LINES options will provide a graphical representation of the results while the LINESTABLE option will provide the results in a tabular form.

Having an example similar to this, the analyst correctly noted that there was a statistically significant difference between the means of the three groups of data, as indicated by the ANOVA table.

To determine where the differences were, a Tukey multiple comparison test was conducted by using the MEANS statement. This led to the connecting letters report from the LINESTABLE option shown earlier and repeated here.

I also added the LINES statement that shows this information in a graphical format since some people prefer a picture to a table with letters.

The analyst said that he could clearly see that Level_3 was different than Level_1. But Level_2 was causing some concern. As he stated: "If I had only studied Level_1 and Level_2 (or Level_3 and Level_2), I would conclude that no differences were found and that Level_2 observations are indistinguishable from Level_1 (or Level_3) observations. The only differences are due to noise. Without testing all three levels, I would have concluded that the mean of Level_2 is the same as the mean of Level_1 or Level_3. But what if I did not have enough resources to test all three levels? Would I have been misled? Does this imply that you should always test every possible level of a grouping variable in order to achieve the correct results?"

These are very good questions, but there is a flaw in the logic that is buried in his questions. The Tukey multiple comparison test, like most statistical tests, has a null hypothesis of equality. For this situation, there are three tests performed by the Tukey procedure. The null hypotheses are

Mean of Level_1 = Mean of Level_2,

Mean of Level_1 = Mean of Level_3, and

Mean of Level_2 = Mean of Level_3.

Remember that for statistical hypothesis tests, the null hypothesis is assumed to be true until the data tells us otherwise. If the data does not tell us otherwise does NOT make the null hypothesis true. In other words, we do not want to conclude that the null hypothesis is true. This test can only declare groups to have different means. It does not declare groups to have the same mean.

Going back to the questions that he raised, this point was made: “…I would have concluded that the mean of Level_2 is the same as the mean of Level_1 or Level_3.” That statement is the flaw. That statement is believing the null hypothesis to be true. Our statistical test does not allow you to make that conclusion.

This test allows you to say that there is a statistically significant difference between Level_1 and Level_3 at the 95% confidence level. However, we are not able to see a statistically significant difference between Level_1 and Level_2 or between Level_2 and Level_3. There may be a difference, but we could not see it using 95% confidence. Perhaps a larger sample size is needed to better assess not only the means of each group, but also the variances of the groups. Finally, maybe one of the best ways to see that we cannot conclude that the means are equal is to create confidence intervals for the group differences. This would be the CLDIFF option on the MEANS statement. So, the MEANS statement would look like this:

means group/tukey cldiff lines linestable;

Adding this to the code will result in a table with every pairwise mean difference.

This table shows that even for Level_1 versus Level_2, the difference is likely contained in the interval -20.021 and 0.021. Although the interval contains 0, the difference could still be a large negative value. Remember that sometimes when interpreting results, it can be very easy to fall into the trap of accepting the null hypothesis. So be on the lookout for that when forming your own conclusions from your analyses.

Find more articles from SAS Global Enablement and Learning here.

Interpreting the Lines or Connecting Letters Report

Registration is open

SAS AI and Machine Learning Courses