01-26-2014 05:28 AM
I have a variable which theoretically should have got discrete values between 0 and 6. This variable is not a Likert scale based, but I think we can think about it as such.
Practically in my sample, I had 22 independent subjects with 37 observations, i.e., some of them gave two observations while some only one. All observations apart from one were either 0 or 1. Among those subjects with two observations, apart from one case, all values matched (i.e., 0 twice or 1 twice).
I need to calculate the mean of the variable, the standard deviation (or variance) and the 95% CI.
I did several things, first out of curiosity, I have calculated the descriptive statistics as if I had 37 independent observations. I got a mean of 0.513 with SD of 1.044. I did it like this (and also manually):
proc means data = data;
In the next step I averaged two observations within a subject, what gave me a sample of 22 independent values. I have calculated a weighted mean and weighted SD using weights of either 1 or 2, depending on the number of observations per subject. I got a mean of 0.5135 with SD of 1.358. I did it like this:
proc means data = Avged_data;
As a last attempt, I did this:
proc surveymeans data = data;
And I got a mean of 0.5135 with SD of 1.13
What bothers me, is that I don't understand, or more precisely know, how SAS calculated the variance in the last case. I tried looking at the help documents, and saw some complicated variance formulas (Taylor and others), however I did not see any formula specific for the cluster statement case.
I wanted your advice, first for the correctness of calculating a mean and SD when a vast majority of an ordinal variable values are either 0 or 1, and second, how would you do it ? Would you choose the weighted approach, or the PROC SURVEYMEANS with the ready cluster statement that to my understanding is exactly what I need. Usually I do not like using a method I do not understand, and my fear is that at the current moment PROC SURVEYMEANS with cluster statement is a black box, does anyone knows what is the rational behind the calculation. If I'll know some details maybe I can find the article in the literature which is the source of the calculations...
Thank you in advance !
I just tried one more option, I ran a GEE model with an explanatory variable of 1's, i.e. a vector (1,1,1,1,1,...,1) taking an idea Steve gave me with similar problem some time ago with a binary outcome. I did
proc genmod data = data;
model Y = D1;
repeated subject=SubjectID / corr=cs corrw;
it gave me a mean of 0.5271 with SD of 1.23 (it gave me S.E of 0.2029 and I have extracted the SD from there). I tried the same with PROC MIXED with no success (didn't give me an intercept "line" at all in the fixed effect table). However, I see no additional value in having another pair of (mean,SD) to choose from, it is confusing enough now.
The bottom line, this doesn't change my problem, how do you choose the correct answer, now that I have one more candidate of possible values....?
01-26-2014 12:37 PM
I've pondered the exact same situation. I do not have a definitive answer, but I have always attributed the differences to the different methods of calculating variance from clustered data--proc means uses an exact formula, while surveymeans uses a Taylor series approximation.
Your first example assumes each observation is independent, but they aren't. The SD is underestimated.
The second and third examples should yield the same result, but they don't, and I have always attributed this to the different methods of calculating variance.
Your fourth example includes an explanatory variable, which affects the results. Did you try it without an explanatory variable, e.g., model y = ; ?
I am very interested to hear from others on this topic.
01-27-2014 09:53 AM
Removing the explanatory variable makes it very difficult to model the repeated nature of the data--some times you have to come up with work-arounds, and this one has worked (well, at least given results) in the past. Note that the marginal MLE of the mean may not match any of the other methods, due to imbablance.
01-27-2014 12:48 PM
Mark, thank you for the reply.
I have removed the explanatory variable and I got the same result as I got with it, since it's a vector of 1's. I have to admit that this idea of Steve is one of the best tips I got for a long time, I like it.
If I knew that the second way, of averaging within a subject and using weighted statistics of the averages (mean, SD, median, ...) is correct, I would choose it, since the Taylor series was is an approximation and my sample size is not very large. But is it the right way?
01-27-2014 04:03 PM
After giving it more thought, I would go with the Taylor series estimate. The method of using weighted averages with proc means does not account for the intraclass correlation of a cluster (i.e., multiple measures from the same subject are not independent).
01-28-2014 04:20 AM
By averaging and weighting I thought I have created an independent sample, since my data is now one row for each subject. Or perhaps the weights are destroying that.
I would also choose the Taylor series estimate, however I am not feeling comfortable to use a method I know so little on. The one thing I am not so sure about, is how SAS calculates the variance when I define
The modelling approach is also valid, however the estimate of the mean is slightly inaccurate and as far as I understand the correlation should affect the mean, just the variance.
01-28-2014 08:32 AM
The difference in the LSmean and any raw mean is due to data imbalance, and not necessarily to the correlation. See the LSMEANS statement in the Shared Concepts and Topics chapter of the documentation. In particular, the paragraph:
The LSMEANS statement computes least squares means (LS-means) of fixed effects. In the GLM, MIXED, and GLIMMIX procedures, LS-means are predicted population margins—that is, they estimate the marginal means over a balanced population. In a sense, LS-means are to unbalanced designs as class and subclass arithmetic means are to balanced designs.
This applies to GENMOD as well. If I were responsible for this analysis, these values of the measure of central tendency (note pedantic definition) would be what I would report. For me, it comes down to: I trust LSmeans because I'm familiar with them and what they represent, and don't trust other methods of calculating means so much.
01-28-2014 12:11 PM
A very interesting discussion!
My only concern with reporting Lsmeans is they won't match the arithmetic means with unbalanced data. It depends on your audience, but it could be difficult to explain the difference.
01-29-2014 01:57 AM
Pardon me for taking it a step backwards, but aren't Lsmeans used when I have an explanatory variable (and random effects) and I want the mean adjusted for the effect of the various stratification ?
How do I obtain the Lsmeans with a single sample ? Let's say I choose the GENMOD way, if I remember correctly, SAS doesn't let you use the lsmeans statement unless a categorical variable follows ?
01-29-2014 10:56 AM
Mark, it could be difficult, but it is what needs to be done if you are interested in the inferential value of the estimates, e.g., what would you expect the value to be if you repeated the experiment. Raw means from unbalanced datasets will provide a biased estimate of the expected value.
BlueNose, for a single sample, there is no issue of imbalance. Thus, the lsmean IS the raw mean. However, for the examples in the earlier posts, there are influential factors--clustering in particular. Now, if every cluster is identical in size, with the same correlation between and within, then the raw mean and the lsmean will be equal. But that is so seldom the case with real world data that I would consider it "pathological data" if it occurred.
01-30-2014 04:30 AM
Steve, you are right, the clusters are indeed not identical in size, thus the lsmean should not be equal to the raw mean. This leads me back to the initial dilemma of choosing either the result of the proc surveymeans (which I have found a book with a reference to) or to go for proc genmod with a "continuous dummy" as explanatory. The results are not identical, however very very close.
01-30-2014 09:07 AM
And the key comes down to design, I would think. If it is truly a survey, where the sampling units and the population size are known, then SURVEYMEANS seems appropriate. If it were a designed study, or if the relationship within a cluster (covariance structure) may be of importance, then GENMOD, and if there is a need to infer to an unknown population, then GLIMMIX to obtain broad inference space standard errors.
01-30-2014 09:33 AM
Actually, the data doesn't come from a survey, it comes from a clinical trial (designed study - single arm study). Some patients contributed a single observation while the other two. These patients are clusters. The population is clearly infinite.
I find it hard to say what is the covariance structure, so I assumed "compound symmetry", seemed most reasonable, or shall I say, I don't know better or other.
Would you recommend trying GLIMMIX with the identity link function ? Which model (GLMM vs. GEE) is more robust to violation of normality of the analyzed variable (I would say dependent but I do not have an explanatory) ?
01-30-2014 10:10 AM
I would try GLIMMIX with the default identity, and model the clusters as G side. If you do it R-side, it is a GEE model, and the results should be identical to what you have seen in GENMOD.
With a maximum of two obs per ID, all covariance structures (except VC) are equivalent. CS is a completely logical selection. You may be able to fit a no fixed effect model (never tried it myself), with
model Y= /solution;
if you fit with D1 as in genmod, then use
model y=D1/noint solution;
02-02-2014 02:40 AM
Thank you Steve, I ran this code:
proc glimmix data = data;
model Y= /solution cl;
random intercept / subject = SubjectID;
and I got the widest CI of all other methods. Actually, my Y variable can't be negative, and my CI is [-0.003,1.13]. On one hand a wide CI is conservative, and my intuition say this is the way I should choose, on the other hand, this particular CI is not very informative (well, the upper limit is...)
Tank you for your help, things are much cleared now
One more thing, and I apologize for mixing things here. I had two centers in this study, which is a rather small study. I just checked if the data is "poolable". I ran a non parametric test (because the data is not really normal) and it was significant, however it didn't take into account the repeated measures nature. Then I ran a GEE which was also significant, however GLIMMIX was not (p=0.06). My question is, what rationale could I take for using a mixed model over GEE ? Are there any advantages, like better performance in small samples, or better robustness to normally assumption violation ? I am looking for a rational since my data is small and it will make things complicated if I can't pool it.