05-22-2017 03:59 PM
I'm looking for advice/feedback about analyzing data from a treatment study.
The outcome was a frequency count of conduct problems collected each day for 40 days.
A within-persons design was used, with one treatment implemented for the first 20 days and
a second treatment implemented the last 20 days (order was counterbalanced across participants).
The primary test of interest is a personality x treatment interaction, where personality is
a continuous measure collected prior to the onset of treatment. Time is not really a variable
of interest, but since data were collected over time I think it needs to be taken into account
in the analyses.
Here are some of my other thoughts about the statistical model/code:
1. The outcome is a count so use glimmix to fit a negative binomial model
2. Time consists of 40 repeated measures (day 1 to 40) so time should be a continuous measure
3. There seemed to be lots of variance acros participants and across days so include a random intercept and slope.
With those thoughts in mind, here is the syntax I've come up with:
proc glimmix data=work.temp ; title1 'Conduct Problems outcome'; class ID treatment; model CondProb_sum = time personality|treatment / solution dist=negbin link=log ddfm=bw; random intercept time /subject=id type = chol; nloptions maxiter=200 tech=nrridg; lsmeans treatment / ilink; run;
In the above code:
Treatment = 0 (standard treatment) vs. 1 (modified treatment)
Personality = continuous score collected at baseline and centered at the sample mean
Time = day in treatment ranging from -39 (first day of treatment) to 0 (last day of treatment)
I'm not a statistician and I'm relatively new to SAS so any advice, thoughts, feedback, etc. on any of this is greatly appreciated.
Thanks in advance.
05-22-2017 06:48 PM
Your current model assumes (1) that the relationship between personality and CondProb_sum on the link (log) scale is linear for both levels of treatment; (2) that the relationship between time and CondProb_sum on the link (log) scale is linear; and (3) that the slope of the relationship between time and CondProb_sum is the same regardless of personality and/or treatment (i.e., that there are no interactions between time and personality and/or treatment). Are these assumptions valid?
Your design assumes that order of treatment does not matter. Is this assumption valid?
Is personality measured once for each participant, or twice (once before each treatment)? (Once, I think, but it's not entirely clear.)
Sampling units assigned to treatment levels (i.e., the two sets of 40-days) are nested within participants; the current model does not incorporate this design element but should (as random treatment*id).
If you have no real interest in how the response varies over time, then you could combine (e.g., sum) values of CondProb_sum over the 40 days, and use the combined statistic as the response in a much simpler statistical model.
If you haven't done so already, plot CondProb_sum versus time for each treatment for each participant (depending on the number of participants), and see how that informs your statistical analysis. You could also work up plots for CondProb_sum versus personality, although that takes more thought, maybe use the sum of CondProb_sum to get time out of the way, or do a plot for each level of time.
05-23-2017 12:41 PM - edited 05-23-2017 12:49 PM
Thank you for these helpful thoughts. Some thoughts in response:
1. I do think linear assumptions are valid and my preliminary analyses suggest that there are main effects of time but not interactions with time.
2. order was counterbalaned, but I think including it in models is a good idea.
3. Personality was measured one time only -- at baseline, before the start of treatment.
4. Can you give me more detail about the suggestion to include treatment nested in person? Would this essentially mean adding a second random statement along the lines of: Random treatment(id) / subject = id type = chol; ?
5. Thank you for suggesting the plots. I created the suggested time x DV plot and it raised an important question -- in running these type of models do I need to worry about oulier values of the DV (conduct problems)? If so, can you point toward information on how to detect and handle them?
Thanks again for your helpful thoughts.
05-25-2017 12:15 AM
1. Linear is so much easier, as long as it's appropriate. If there truly (which, of course, we don't know for sure) are no interactions with time, I would give serious consideration to dropping time from the analysis by combining (in this case, summing) the response over the multiple times for these reasons: (1) In your original post, you said that time was not of interest. (2) Outliers/influential observations are quite problematic when you are regressing on time. (3) The statistical model would be much simpler. (4) The distribution of the sum of all those counts might be less problematic.
2. Adding order to the model is possible; the model is a known as a "crossover design" (of which a Latin square is a special case). A big issue in crossover designs is carryover. As stated here
"Contrary to what is sometimes believed, counterbalancing does not eliminate bias caused by carryover effects, regardless of the number of treatments...." It's a big topic, I'll quit here.
4. To specify experimental units for treatments that are nested within id, the syntax could be
random treatment / subject=id;
The default CS covariance structure is probably adequate, unless variances are unequal for the two treatments. Note the distinction between treatment which is a fixed effects factor with two levels, and the experimental unit (a random effects factor) for treatment which is a set of 20 days with 2 x (number of subjects) levels. "id*treatment" tells SAS to estimate a variance for (number of id) x (number of treatment) units; it's a syntax shortcut, alternatively, you could provide a unique id value for each set.
5. Outliers/influential values are a problem for this model just like they are a problem for any regression model. The solutions (or lack thereof) are the same regardless and are often context-specific. Give some thought to my suggestion in point 1.
You're welcome, hope they help.
05-23-2017 08:58 AM
If it was repeated measure, you should use R-side random effect , not G-side .
proc glimmix data=work.temp ; title1 'Conduct Problems outcome'; class ID treatment; model CondProb_sum = time personality|treatment / solution dist=negbin link=log ddfm=bw; random intercept /subject=id type = chol;
random time /subject=id _residual_ type=ar(1) ;
nloptions maxiter=200 tech=nrridg; lsmeans treatment / ilink; run;
05-23-2017 12:46 PM - edited 05-23-2017 12:48 PM
Thanks for this suggestion. Because of the number of repeated measures -- the DV (conduct problems) was measured 20 times within each treatment condition for a total of 40 measurements -- I have treated time as a continuous variable. As such, I get an error when I try to include "_residual_" in the random statement. The error says something about not being able to include a continuous measure.
Any suggestions for addressing this or handling time differently?
Thanks for your help.
05-25-2017 12:22 AM
Without implying that your proposed model is valid (I have my doubts), if you want to fit an AR(1) structure for repeated measures associated with a factor that is continuous in the MODEL statement, then make a copy of it that is not included in the CLASS statement. For example (but untested)
proc glimmix data=have; class id time; xtime = time; model y = xtime; /* xtime as continuous because it is not in CLASS */ random time / subject=id type=ar(1) residual; /* time as classification */ run;
05-25-2017 05:28 PM
Thanks for this idea. Please tell me more about your doubts regarding the
proposed model (if it's not too much hassle to do so in this format).
I welcome any and all feedback and suggestions.
05-27-2017 05:04 PM - edited 05-27-2017 05:05 PM
The attached file illustrates various models that I would consider for your design (presuming, of course, that I understand the design correctly). The dataset that is created for the illustration has a response that follows the normal distribution; your data are counts, so normal might not be an appropriate choice. For the most part, the code would not change much, but GLMMs are persnickety and code often has to be adjusted to accommodate data characteristics.
For time as a classification variable, the biggest adjustment is that you probably will not have much success with an explicit repeated measures model
RANDOM ... / TYPE= ... RESIDUAL;
for two-parameter distributions (like the negative binomial); as Walt Stroup notes in his text
there apparently is an inherent conflict between estimation of the scale parameter and covariance structure (at least, at the time the book was written). That's about what I know about that, other than that in practice, the model doesn't work.
I'd still think about combining data over days and ditching time as a variable especially if you have issues with outliers, etc. that make regression problematic.