Hi,
I'm running into some trouble trying to figure out contrast statements with proc mixed. I'm attempting a difference-in-difference model to compare proportions between hispanic/latino and non-hispanic/latino given an event that occurs. I'm trying to contrast or compare the gap pre intervention vs post intervention between hispanic and non-hispanic, I believe the best way to accomplish this is through the use of a contrast statement?
Figure. to illustrate what I'm trying to compare (ie the gap between orange and blue line pre-intervention to the gap between orange and blue lines post-intervention).
data hisp1;
input Hispanic time COVID percent_perf;
datalines;
1 0 0 33.3333
1 1 0 36.0172
1 2 0 34.9497
1 3 0 38.1514
1 4 0 35.0831
1 5 1 38.587
1 6 1 39.7946
1 7 1 37.7932
1 8 1 39.0023
1 9 1 35.4019
0 0 0 31.5364
0 1 0 31.0515
0 2 0 30.2232
0 3 0 32.5556
0 4 0 32.3446
0 5 1 33.6667
0 6 1 33.3706
0 7 1 29.8397
0 8 1 32.5166
0 9 1 29.5235
;
run;
PROC MIXED DATA = hisp1 METHOD=ML
PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
CLASS Hispanic (ref="0") COVID (ref="0") ;
MODEL percent_perf = time COVID COVID*time Hispanic Hispanic*time Hispanic*COVID Hispanic*COVID*time/ S ;
REPEATED intercept / TYPE = UN R;
RUN;
Thank you!
What variable gives you the pre- or post-intervention information?
How is this repeated measures data? Do you have two subjects in the data, each with 10 measurements?
I can help you with an ESTIMATE/CONTRAST statement to test difference in difference, but you might want to first to make sure your model is reasonable for your data.
Thanks,
Jill
Hi Jill! Thanks for the response. I actually simplified the data/model specification to post here since I wasn't quite sure how to best share a sample of the data set with all the subjects included.
PROC MIXED DATA = new3 METHOD=ML PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
CLASS hospital_number race (ref="2") COVID (ref="0");
MODEL percent_perf = time COVID COVID*time race race*time race*COVID race*COVID*time/ S ;
REPEATED / subject = hospital_number TYPE = ar(1) R;
RUN;
My data is actually at the hospital level, so I have observations for each hospital at quarterly intervals. The "intervention" is the variable COVID (0 - denoting a time period before the pandemic, 1 - denotes a period after).
Hopefully that clarifies it. 🙂
Thanks for the info!
So you are fitting an ANCOVA model. Do you want the DID for the intercept or the slope? I will provide both below --
PROC MIXED DATA = new3 METHOD=ML PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
CLASS hospital_number race (ref="2") COVID (ref="0");
MODEL percent_perf = time COVID COVID*time race race*time race*COVID race*COVID*time/ S ;
REPEATED / subject = hospital_number TYPE = ar(1) R;
estimate 'DID for race*covid when time=0' race*covid 1 -1 -1 1;
estimate 'DID for the slopes between race*covid' race*covid*time 1 -1 -1 1;
RUN;
Hope this helps,
Jill
Thanks Jill!
I'm not sure that's exactly what I'm trying to estimate.
estimate 'DID for race*covid when time=0' race*covid 1 -1 -1 1;
If I understand this correctly, this would be the estimated mean "jump" in the outcome at the interruption between races. Essentially Beta 6 in the model output (race*covid). Which checks out when comparing the estimate with the model output.
estimate 'DID for the slopes between race*covid' race*covid*time 1 -1 -1 1;
Similarly, isn't this the estimated difference in slopes after the interruption between races, or B7 time*race*covid.
Perhaps, I'm being silly and that is already explained by the model output. But how would I go about comparing the interruption periods, ie is the gap between races pre-interruptions significantly wider or smaller compared to after the interruption?
Because your model has the covariate TIME, you essentially are fitting a regression model for different groups. For your DID request, you need to specify the TIME value. At what TIME value do you want this DID?
That TIME variable is the piece of the estimate statement that I think I find the most confusing. If I wanted to get the estimate for each TIME would I just add one to the estimate?
estimate 'DID for the slopes between race*covid time = 0' race*covid*time 1 -1 -1 1;
estimate 'DID for the slopes between race*covid time = 1' race*covid*time 1 -1 -1 2;
estimate 'DID for the slopes between race*covid time = 2' race*covid*time 1 -1 -1 3;
.
.
.
estimate 'DID for the slopes between race*covid time = N' race*covid*time 1 -1 -1 N;
Thanks for your help!
It does not make sense to compare slopes for a specific time point. You might want to compare the expected response value between different groups at a certain time point. Below are some ESTIMATE statements you might find helpful --
estimate 'DID for race*covid at time = 0' race*covid 1 -1 -1 1;
estimate 'DID for race*covid at time = 1' race*covid 1 -1 -1 1 race*covid*time 1 -1 -1 1;
estimate 'DID for race*covid at time = 2' race*covid 1 -1 -1 1 race*covid*time 2 -2 -2 2;
However, I am not sure if this makes practical sense -- do you have measurements at time 1 for post intervention? Or is it always times 0 to 4 for pre and times 5-9 for post? If so, you might want to reconsider your model specifications, considering what your analysis goal is.
Jill
The OP might also wish to consider using a generalized linear (mixed) model, since the response variable is a proportion. GENMOD or GLIMMIX seem more appropriate, depending on the need for marginal or conditional means/errors and on the inference space to be used.
SteveDenham
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.