Solved: Re: GLiMMix: clustered for 2 times & 2 locations in each subject

Kastchei · Posted 05-29-2013 12:50 AM

Hello,

I see that some help has been given to those with g- and r-side specifications, so perhaps someone could help me. I'm not having many problems with convergence, just in specifying a model that makes the most sense for this pilot study. There are:

9 animals (animalID);
2 treatments (treatment😞 5 animals get placebo and 4 study drug;
2 tissue types (tissueType😞 Vaginal and Cervical;
2 times points (time😞 Before and after treatment, before = day 0, after = day 14.
Each animal has a tissue sample biopsied from both types of tissue both before and after: 9 * 2 * 2 = 36 clusters (20 placebo, 16 study drug).
Each biopsy is separated into 3 to 6 equal sized sub-biopsies, infected with a microbe, washed, cultured, and analyzed for the presence and amount of the viral protein.
I have no data on which sub-biopsies were contiguous to which in the original biopsy.
It is not the case that the same sub-biopsy is tested before and after treatment in a paired fashion; the two biopsies (and their sub-biopsies) are different tissue, and there can be a different number of sub-biopsies per cluster even from the same animal and tissue type (e.g. animal X, vaginal tissue, before cluster has 6 sub-biopsies, after cluster has 4).
There is no missing data.

E.g.

treatment	animalID	tissueType	time	replicate	Infected	AUC
Placebo	P1	vagina	before	1	0	0
				2	1	2681.342
				3	0	0
				4	1	1695.816
			after	1	0	0
				2	0	0
				3	1	14965.546
				4	1	987.683
				5	0	0
		cervix	before	1	0	0
				...	...	...
				5	etc.	etc.
			after	1
				...
				3
	P2	vagina	before	1
				...
				6
			after	1
				...
				5
		cervix	before	1
				...
				4
			after	1
				...
				6
etc.	etc.	etc.	etc.	etc.	etc.	etc.

Thus, we have two time points for each tissue type, but each time point is a cluster of values, not just one measurement. We want to analyze the data both as binomial (presence or absence of protein) and log-normal (amount of protein). The goal is to see if the treatment decreases the rate of infection (> 0 protein) and the level of protein and if the different tissues are affected different. One immediate problem is that the log-normal part is actually zero-inflated. I'll get to that later.

To model the binary portion, I am struggling to figure out the appropriate RANDOM statements. Because I have a different number of sub-biopsies per cluster, I don't think this can be modeled R-side at all. If I only had one measurement per time point and only one tissue type, I'm pretty sure the model would be R-side:

proc gLiMMix data = c order = internal plots = all;

class animalID treatment time;

model infected(event = '1') = treatment|time / dist = binary link = logit solution cl oddsRatio;

random time / subject = animalID(treatment) type = cs residual;

run;

If I add back in the fact that I have clusters of measurements instead of just one, I think it gets forced into the G-side and becomes:

random intercept / subject = time*animalID(treatment) type = cs;

This seems to take care of the clustering by time and animal, but maybe ignores that the two times within the same animal are correlated?

If I add back in the tissue type, do I just further interact my subject=?

proc gLiMMix data = c order = internal plots = all;

class animalID treatment tissueType time;

model infected(event = '1') = treatment|tissueType|time / dist = binary link = logit solution cl oddsRatio;

random intercept / subject = time*tissueType*animalID(treatment) solution cl v type = cs;

run;

Does this ignore that the two tissue types within the same animal should be correlated?

Now, the second question concerns the log-normal portion. The researchers want to know whether the study drug reduces the amount of viral protein. If the 0s are ignored, the data is log-normal. Would it be appropriate to analyze only those samples with > 0 protein to see if the study drug lowers the amount of protein only for those who got infected at all? Clearly this answers a slightly different question: if infected, are viral protein levels lower, as opposed to, are viral protein levels lower. If I can just model the > 0 portion, then I imagine I would set up my random statement the same was as for the binary portion, just switching over to lognormal.

proc gLiMMix data = c order = internal plots = all;

class animalID treatment tissueType time;

model AUC = treatment|tissueType|time / dist = lognormal solution cl;

random intercept / subject = time*tissueType*animalID(treatment) solution cl v type = cs;

run;

Thanks!

SteveDenham · Posted 05-31-2013 08:42 AM

Those are the block diagonal covariance matrices I was looking for. Any notes in the log or output about the Hessian, or such? If not, then I think you have what you need, with maybe one or two tweaks. If you stay with the default pseudo-likelihood fit, consider using the Kenward-Rogers adjustment for the degrees of freedom (in STAT12.1 GLIMMIX, use ddfm=kr2). This really applies for repeated measure and small datasets, such as this one.

If you move to marginal likelihood methods, which are possible as you are modeling everything G side, the Kenward-Rogers adjustment is not available.
You might want to consider using the EMPIRICAL option to get sandwich estimators and standard errors. The whole point here is that with small samples and a binary response, and in my opinion (note underlining), these provide at least some bias control.

Steve Denham

View solution in original post

SteveDenham · Posted 05-30-2013 01:15 PM

I think your final code for the binary variable is just what you need--giving conditional estimates, and accommodating all of the correlations. I would check some additional structures, as the compound symmetry assumption is pretty strong. There is no real good block diagonal structure available in GLIMMIX, and an unstructured/Cholesky covariance probably has too many parameters to fit with this small dataset.

As far as the latter analysis, what about setting the zeroes to some small value? Judging by the size of the non-zero values, setting the AUC to 1 would probably give you the analysis that you need.

Steve Denham

Kastchei · Posted 05-30-2013 04:21 PM

Steve,

Thanks for your help. I've really valued your contributions to a lot of these threads. A couple followup questions.

I actually had two analyses, one which had two time points (before and after), and another that only had 1 time point, which was a bit simpler. For the one with only one time point, I modeled initially as we discussed above but with unstructured, which results in a block diagonal variance structure with the block being a particular subject-tissue combination.

random intercept / subject = tissueType*animalID(treatment) type = un solution cl g v;

Estimated V Matrix for anima*tissue(treatm) X cervical Study Drug

Row	Col1	Col2	Col3
1	6.3871	2.3454	2.3454
2	2.3454	6.3871	2.3454
3	2.3454	2.3454	6.3871

The problem I had with this was that there were no correlations within the same animal but across tissue type. I would assume that there would still be some correlation between vaginal and cervical tissue within the same animal, even if maybe less than within the same tissue type. So I remodeled as this, which gave me a different variance per subject-tissue, a different covariance within the same subject-tissue between the two tissues, and a covariance between tissue types.

random tissueType / subject = animalID(treatment) type = un solution cl g v;

Estimated V Matrix for animalID(treatment) X Study Drug

Row	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
1	6.2503	2.2501	2.2501	2.2501	2.2501	2.2501	0.8225	0.8225	0.8225
2	2.2501	6.2503	2.2501	2.2501	2.2501	2.2501	0.8225	0.8225	0.8225
3	2.2501	2.2501	6.2503	2.2501	2.2501	2.2501	0.8225	0.8225	0.8225
4	2.2501	2.2501	2.2501	6.2503	2.2501	2.2501	0.8225	0.8225	0.8225
5	2.2501	2.2501	2.2501	2.2501	6.2503	2.2501	0.8225	0.8225	0.8225
6	2.2501	2.2501	2.2501	2.2501	2.2501	6.2503	0.8225	0.8225	0.8225
7	0.8225	0.8225	0.8225	0.8225	0.8225	0.8225	6.6233	2.5727	2.5727
8	0.8225	0.8225	0.8225	0.8225	0.8225	0.8225	2.5727	6.6233	2.5727
9	0.8225	0.8225	0.8225	0.8225	0.8225	0.8225	2.5727	2.5727	6.6233

First, what do you think of this structure instead? Does this make sense to do so? Or does your warning about too many parameters for the small sample size come into play here?

Second, I had also modelled as this:

random animalID(treatment) / type = vc solution cl g v;

random tissueType / subject = animalID(treatment) type = un solution cl g v;

which as it turns out, gives me an identical V matrix, just that the former estimates only 3 covariance parameters (vaginal variance, cervical variance, and vaginal-cervical covariance), whereas the latter breaks out an animal part from those 3 parameters. This makes sense intuitively. However, the degrees of freedom is only half. I am getting the same V matrix, the same parameter estimates (random parameter estimates are the animal + the animal-tissue), and the same standard errors. However, when written on one line, I use 14 DF for all my tests (t and denom F), but when broken out into two lines, I'm now only getting 7 DF. I assume this is because I am now forcing the model to estimate an intercept for each animal (breaking the previous estimates into two parts). Is that essentially correct? If so, then there's really no reason to do this latter method unless I was actually interested in what the parameter estimates for each animal was.

SteveDenham · Posted 05-31-2013 08:42 AM

Those are the block diagonal covariance matrices I was looking for. Any notes in the log or output about the Hessian, or such? If not, then I think you have what you need, with maybe one or two tweaks. If you stay with the default pseudo-likelihood fit, consider using the Kenward-Rogers adjustment for the degrees of freedom (in STAT12.1 GLIMMIX, use ddfm=kr2). This really applies for repeated measure and small datasets, such as this one.

If you move to marginal likelihood methods, which are possible as you are modeling everything G side, the Kenward-Rogers adjustment is not available.
You might want to consider using the EMPIRICAL option to get sandwich estimators and standard errors. The whole point here is that with small samples and a binary response, and in my opinion (note underlining), these provide at least some bias control.

Steve Denham

Kastchei · Posted 05-31-2013 01:44 PM

Thanks again, Steve! When I use the Kenward-Rogers adjustment, the DF for both the one line or two line specifications are equal. These options you have suggested are unfamiliar to me. Could you give an explanation, even if brief, as to when the KR option would be used? And also when would I want to switch from a PL to a ML method? Is the empirical option only for ML or both methods, and what are sandwich estimators

Thanks,

Michael

SteveDenham · Posted 06-03-2013 09:54 AM

When to use KR adjustment: For all repeated measures designs that have a covariance structure other than compound symmetry with equal observations per subject, and for all repeated measures designs that have unequal observations per subject. Check the reference in the documentation:

Kenward and Roger, 2009,

“An Improved Approximation to the Precision of Fixed Effects from Restricted Maximum Likelihood,” Computational Statistics and Data Analysis, 53, 2583–2595"

The empirical option is available for all methods, and generates what are called sandwich estimators. They "are useful for obtaining inferences that are not sensitive to the choice of the covariance model." (from the documentation). Overcoming variance heterogeneity is one major use. Note that some of the methods are residual based and so are not available with LAPLACE or QUAD methods. Read the Empical Covariance ("Sandwich") Estimators section under Details for the GLIMMIX Procedure for an extensive discussion.

Steve Denham

SAS Innovate 2025: Call for Content

Classroom Training Available!