12-28-2015 04:12 PM
This is more of a modeling question than a SAS question. I have an experiment I need to analyze where there are two stratification variables: type of tissue and level of viral challenge. The type of tissue is either ectocervical or endocervical. The level of viral challenge is either 50 or 500. There should be 4 combinations: ecto 50, ecto 500, endo 50, and endo 500. The problem is that the lab didn't do one of the combination: endo 50. Here's a table summary of what data exists.
In a model of outcome = tissue|virus, given that I am missing one type of data completely, I cannot test for the overall interaction, but I can get LSMeans for the three pairwise differences of the existing data. I can somewhat assume there is an interaction if the adjusted pairwise differences show significance somewhere.
As it turns out, this model gives the same LSMeans results as if I just combined the two variables into one variable with 3 levels: ecto500, ecto050, and endo500. The model would then just be outcome = tissueVirus. The math to get to the results is slightly different, but the end results are exactly the same estimates, differences, variances, test statistics, and p-values.
The plot thickens, because these two factors are not actually the variables of clinical importance. Instead, there are a whole host of other measurements that need to be tested. This is an exploratory analysis. For each measurement, I'll want to stratify by tissue and virus level. The model would either look like 1) outcome = measurement|tissue|virus or 2) outcome = measurement|tissueVirus.
My question is: should I use 1 or 2? 2 seems simpler, has fewer things to estimate, and doesn't display a ton of non-estimatable results. However, 1 is actually closer to the conceptual design of the experiment. It seems, though, that because of the missing level, that perhaps there is no conceptual difference between 2x2 missing a cell and 1x3.
Thanks in advance!
12-30-2015 02:46 PM
In Milliken and Johnson's classic text Analysis of Messy Data, they cover this type of thing extensively. Rather than recoding to a single variable, they analyze a "means model", which in this case would look like:
Note that there are no main effects in this--it is a one way ANOVA, so any comparisons have to be done with ESTIMATE or (even better) LSMESTIMATE statements. This has the advantages of retaining the original design and not having to write code to convert two variables to one in a DATA step
12-31-2015 08:08 AM
Your intuition is right on the nose. The advantage to the means model approach is that I don't have to come up with code and level labels, which is big for me because I am really lazy.
I think means model approaches were touted by Milliken and Johnson because the Type IV hypotheses (designed for missing cell analyses) in PROC GLM were not unique, whereas the means model resulted in unique hypothesis tests. Now that the LSMESTIMATE statement is available, I can see a lot more analyses going this route.