Dear All,
Clinical trials designed with multiple doses and a placebo group sometimes want to have an estimate of the combined dose group effect compared against placebo at the specified endpoint (eg, Week 8). Essentially, I am wondering if it is better to pool the dose groups prior to running the model or if the dose groups should be pooled in the contrast statement itself. I have provided example code below. I cannot find documentation regarding what the difference is between the two methods and when it is appropriate to use either method. I am working in SAS v9.4.
data test;
call streaminit(33445);
do id=1 to 20;
rid=rand('normal');
trt=ceil(rand('uniform')*3);
if trt in (2,3) then trt2=2;
else trt2=trt;
do time=1 to 2;
y=trt + trt*time + rand('normal') + rid;
output;
end;
end;
run;
proc mixed data=test;
class id trt time;
model y=trt time trt*time / e;
repeated time / subject=id(trt) type=cs;
contrast 'placebo vs active at timepoint 2' trt -1 .5 .5 trt*time 0 -1 0 .5 0 .5;
estimate 'placebo vs active at timepoint 2' trt -1 .5 .5 trt*time 0 -1 0 .5 0 .5;
lsmeans trt*time / diff;
run;
proc mixed data=test;
class id trt2 time;
model y=trt2 time trt2*time;
repeated time / subject=id(trt2) type=cs;
lsmeans trt2*time / diff;
estimate 'placebo vs active at timepoint 2' trt2 -1 1 trt2*time 0 -1 0 1;
run;
Here are the results using trt in model:
Standard
Label Estimate Error DF t Value Pr > |t|
placebo vs active at timepoint 2 5.7302 0.7368 17 7.78 <.0001
Here are the results using trt2 in model:
Standard
Label Estimate Error DF t Value Pr > |t|
placebo vs active at timepoint 2 5.9023 1.2494 18 4.72 0.0002
Many thanks in advance!!
A well-posed question 🙂
First, create a balanced data set so that you aren't trying to juggle the impacts of unbalanced data while you sort out syntax.
data newtest; call streaminit(33445); do id=1 to 10; rid=rand('normal'); *random effect for subject=id; do trt= 1 to 3; if trt in (2,3) then trt2=2; else trt2=trt; do time=1 to 2; y=trt + trt*time + rand('normal') + rid; output; end; end; end; run;
proc tabulate data=newtest;
class trt trt2;
table trt, trt2;
run;
Then run your two models. Note that the estimates of the difference now match, but SEs and DFs do not.
The fundamental difference in the two models lies in the REPEATED statement. The first model using
repeated time / subject=id(trt) type=cs;
identifies 30 subjects (10 IDs for each of 3 TRTs). But the REPEATED statement in the second model using
repeated time / subject=id(trt2) type=cs;
identifies only 20 subjects (10 IDs for each of 2 TRT2s). Consequently SEs and DFs differ.
If my experiment randomly assigned 3 treatments to 10 subjects per treatment so that I actually had 30 subjects in total, I would use the first model rather than the second because the first model preserves the experimental design; the second makes up a new one.
Hi All,
Thank you for the quick response.
I don't think I previously included the hypothesis of interest: Is there is a significant difference between combined groups 2 and 3 versus 1?
I realized that the IDs were identical within the treatment groups so when I combined the two treatment groups it assumed that certain subjects had multiple assessments at each time point (ie, that there were only 10 subjects in the newly created treatment group and therefore only 20 subjects total). I have updated my code to make the subject IDs unique. This experimental design assumes 10 subjects are randomized to 3 treatment groups (ie, 30 subjects total). If I am interested in comparing two pooled groups versus one group I am wondering how the interpretation between the two following models differs? The LSMD estimate is the same, but the SEs differ. I am wondering how to understand the difference between these two models.
My gut is to use the estimate statement because that follows the experimental design, but I am wondering if there is another reason beyond that or if I should use the pooled treatment groups variable instead?
data newtest;
call streaminit(33445);
do id=1 to 10;
rid=rand('normal'); *random effect for subject=id;
do trt= 1 to 3;
if trt in (2,3) then trt2=2;
else trt2=trt;
do time=1 to 2;
y=trt + trt*time + rand('normal') + rid;
output;
end;
end;
end;
run;
data newtest;
set newtest;
id = id * trt + (11*trt);
run;
proc mixed data=newtest method=reml;
class id trt time;
model y = trt time trt*time/ s ddfm=kr covb;
repeated time/ type=un subject=id(trt);
lsmeans trt*time / diff;
estimate 'test1' trt 1 -0.5 -0.5
trt * time 0 1
0 -.5
0 -0.5 /e;
run;
proc mixed data=newtest method=reml;
class id trt2 time;
model y = trt2 time trt2*time/ s ddfm=kr covb;
repeated time/ type=un subject=id(trt2);
lsmeans trt2*time / diff e;
run;
The results I get follow:
The first model
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
test1 -4.7338 0.5992 27 -7.90 <.0001
The second model
Differences of Least Squares Means
Standard
Effect TRT2 TIME _TRT2 _TIME Estimate Error DF t Value Pr > |t|
TRT2*TIME 1 1 1 2 -1.0561 0.4529 28 -2.33 0.0271
TRT2*TIME 1 1 2 1 -3.7173 0.7552 28 -4.92 <.0001
TRT2*TIME 1 1 2 2 -5.7899 0.7797 35 -7.43 <.0001
TRT2*TIME 1 2 2 1 -2.6612 0.8034 34 -3.31 0.0022
TRT2*TIME 1 2 2 2 -4.7338 0.8265 28 -5.73 <.0001
TRT2*TIME 2 1 2 2 -2.0725 0.3203 28 -6.47 <.0001
Another part of the question is also what if you want to perform pairwise comparisons as an exploratory analysis. Would you want to use contrast statements to obtain those LSMDs or would you run the model using only the subjects in the treatment groups of interest? In this case again, one gets the same LSMD estimate but the SE and DF are different.
proc mixed data=newtest method=reml;
class id trt time;
model y = trt time trt*time/ s ddfm=kr ;
repeated time/ type=un subject=id(trt);
lsmeans trt*time / diff;
estimate 'test2' trt 0 1 -1
trt * time 0 0
0 1
0 -1 /e;
run;
proc mixed data=newtest method=reml;
where trt in (2 3);
class id trt time;
model y = trt time trt*time/ s ddfm=kr ;
repeated time/ type=un subject=id(trt);
lsmeans trt*time / diff;
run;
The output from the estimate statement (model 1):
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
test2 -3.5463 0.6919 27 -5.13 <.0001
The output from the subset model (model 2):
Differences of Least Squares Means
Standard
Effect TRT TIME _TRT _TIME Estimate Error DF t Value Pr > |t|
TRT*TIME 2 1 2 2 -1.5477 0.3855 18 -4.02 0.0008
TRT*TIME 2 1 3 1 -2.4967 0.6730 18 -3.71 0.0016
TRT*TIME 2 1 3 2 -5.0940 0.7234 23.5 -7.04 <.0001
TRT*TIME 2 2 3 1 -0.9489 0.7234 23.5 -1.31 0.2023
TRT*TIME 2 2 3 2 -3.5463 0.7705 18 -4.60 0.0002
TRT*TIME 3 1 3 2 -2.5973 0.3855 18 -6.74 <.0001
I greatly appreciate everyone's insight.
1. Use the ESTIMATE statement. The LSMESTIMATE statement is a great feature that makes writing contrasts even easier; check it out in the documentation or see
CONTRAST and ESTIMATE Statements Made Easy: The LSMESTIMATE Statement
2. Use ESTIMATE, CONTRAST, or LSMESTIMATE.
You could also take advantage of the SLICE option on the LSMEANS statement which estimates simple effects and saves you the effort of writing contrasts. The GLIMMIX procedure offers the SLICEDIFF option; check it out.
Thank you for the quick response.
In addition to the ways the means can be estimated I am wondering what the interpretative difference between a model that has a three level treatment group and creating a contrast that 'averages the cell means' and a model that has a two level treatment group. I understand that the point estimates are the same, but the SEs and DFs are different so I am trying to understand the difference between these two methods. Which model is best posed to answer my question of "Is there a difference in group 2 and 3 versus 1?"
Your experimental design involved subjects assigned to three treatment groups, not subjects assigned to two treatment groups. The experimental design determines the statistical model. Post-hoc redefinition of experimental treatments is hardly ever (even never?) a good idea.
In my opinion, the appropriate model specifies three treatment groups with a contrast to compare the mean of groups 2 and 3 to the mean of group 1.
Thank you for your response. This was my thinking as well.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.