Re: Just curious how others present MEAN differences. What do YOU do?

deleted_user · Posted 03-18-2010 01:41 PM

Let’s say you want to report differences among group means following ANOVA, and you want to inform the reader about the ‘spread’ of the data points around the mean of each group.

If no random variable exists (such as blocking effect) and the data is balanced, I’d present arithmetic means with standard deviations obtained from PROC GLM with MEANS statement.

But, the dataset I deal with contains a random variable as well as occasional imbalance among groups as a result of extreme outliers. In such case, I prefer to use PROC MIXED with LSMEANS statement.

However, in such case, I am not sure what information to present with lsmean values to indicate the ‘spread’ of the data points for each group (because LSMEANS statement do not provide separate standard error nor deviation for each group being compared). Is there a procedure to avoid such issue?

Thanks always for letting me pick your brain.

Paige · Posted 03-18-2010 02:59 PM

> If no random variable exists (such as blocking
> effect) and the data is balanced, I’d present
> arithmetic means with standard deviations obtained
> from PROC GLM with MEANS statement.

I would hope you don't really mean this. If you have no random variable, then you have no statistical analysis. Just about any data you collect in the real world is a random variable.

> But, the dataset I deal with contains a random
> variable as well as occasional imbalance among groups
> as a result of extreme outliers. In such case, I
> prefer to use PROC MIXED with LSMEANS statement.

This is rather unclear as well. Imbalance, in the statistical sense, refers to difference in sample size between groups. Imbalance, in the statistical sense, cannot result because of outliers. So, do you mean you have an unbalanced design (different sample sizes in groups) along with outliers? Or not? Or both?

In any event, the method for dealing with outliers is not LSMEANS. LSMEANS could be used to deal with unbalanced designs (unequal sample sizes in each cell) in two-way or greater designs, but again, LSMEANS offers you no protection against outliers. (In a simple one-way design with unbalanced sample sizes, the means are the same as the LSMEANS)

There are several ways to deal with outliers (regardless of balance or imbalance in the design). One is to use a robust estimation procedure (PROC ROBUSTREG). Another is to fit a model, eliminate the outliers, and then re-fit. I'm sure there are other methods people use as well.

deleted_user · Posted 03-18-2010 04:02 PM

Thanks Paige for your reply! Ooops, I meant “random EFFECT” not “random variable”. You are right, no random variable doesn’t make much sense. As for the outliers, what I meant was that I have extreme outliers that I do not include in my analysis. I appreciate your suggestion on PROC ROBUSTREG, and other options. I have attempted using PROC ROBUSTREG, but I found it rather difficult to understand the concept. I might have to give it another try if everything else fails… Sorry for the confusion.

Let me re-state my question with corrections:

Let’s say you want to report differences among group means following ANOVA, and you want to inform the reader about the ‘spread’ of the data points around the mean of each group.

If no random effect exists (such as blocking effect) and the data is balanced, I’d present arithmetic means with standard deviations obtained from PROC GLM with MEANS statement.

But, the dataset I deal with contains a random effect as well as occasional imbalance among groups as a result of extreme outliers (that I do not include in my analysis). In such case, I prefer to use PROC MIXED with LSMEANS statement.

However, under such circumstance, I am not sure what information to present with lsmean values to indicate the ‘spread’ of the data points for each group (because LSMEANS statement do not provide separate standard error nor standard deviation for each group being compared). Is there a procedure to avoid such issue?

Thanks for your time!

SteveDenham · Posted 03-19-2010 07:34 AM

What is wrong with presenting the arithmetic mean and standard deviation, if your objective is to talk about the spread of the data in each group? There is that subtle but important difference between the distribution of the data and the distribution of the mean of the data. The least squares means and associated standard errors tell us something about the expected values of the population of means--what we would expect to see if we repeated the experiment over and over again. The lsmean is the estimate of location, and the associated standard error is the estimate of the variability in all of those means, given the model that is fit. It doesn't really reflect the variability of the data in the single realization of the experiment that you have at hand--for that the standard deviation is as useful as anything. But there are other estimates for that single realization that might be more robust, with the interquartile range popping into my mind.

Here is another way of thinking about this: Suppose you repeated the experiment a thousand times. Each time you would get somewhat different values of the arithmetic means and standard deviations. Now look at the distribution of the means that came out of all this. The variability of this new distribution is estimated with the "standard error of the mean". If we want to compare means to see if they are "statistically different", these are the distributions of interest--not the distribution of the data.

And all this depends on taking a frequentist approach to analysis, rather than a Bayesian approach. Distributional assumptions are important.

Just my opinion.

SteveDenham

deleted_user · Posted 03-19-2010 12:22 PM

Hi Steve, thanks so much for your detailed explanation. I appreciate your patience on this matter as I can be slow to catch on to statistical things.

This time around, I see better the distinction between LSMEANS (best estimate of the population mean) and MEANS (arithmetic mean of a given sample at hand). Also, your explanation really helped me to distinguish “lsmean +/- standard error” vs. “arithmetic mean +/- standard deviation”.

As you reminded me, when I compare groups, my interest is not to compare the means of these particular groups (made up of random samples), but to make an inference to the population means and see if the samples were drawn from populations with statistically different population means.

I realize that different disciplines have their own conventions, but going by what I’m leaning here, I begin to feel that it makes most sense to be reporting “lsmean +/- standard error” when conducting inferential statistics (I wonder what others think of this).

At the same time, this brings me back to my dilemma and the reason for starting this thread:

I’m beginning to like the idea of reporting “lsmean +/- standard error”, but how informative is it for the readers of my report to see lsmeans with identical standard errors (see below)? Thanks for this learning opportunity Steve.

e.g.

Group_______Mean estimate (lsmean)_______Standard Error
__1______________558.75___________________64.67
__2______________406.63___________________64.67
__3______________190.18___________________64.67

Paige · Posted 03-19-2010 03:28 PM

You have to decide what the purpose is of providing such a table.

If you want to enable your audience to determine whether or not the LSMEANS are statsitically different, you need to show the standard errors. Alternatively, you could present confidence intervals for the LSMEANS.

If you want to enable your audience to have some understanding of the variability of the raw data, you might want to show the standard deviation of the raw data. However, I think a much better idea is to show the standard deviation of the residuals instead of the standard deviation of the raw data.

These are not mutually exclusive. Message was edited by: Paige

Just curious how others present MEAN differences. What do YOU do?