Re: proc mixed contrast statements difference in difference approach

Shad · Posted 09-20-2021 04:06 PM

Hi,

I'm running into some trouble trying to figure out contrast statements with proc mixed. I'm attempting a difference-in-difference model to compare proportions between hispanic/latino and non-hispanic/latino given an event that occurs. I'm trying to contrast or compare the gap pre intervention vs post intervention between hispanic and non-hispanic, I believe the best way to accomplish this is through the use of a contrast statement?

Figure. to illustrate what I'm trying to compare (ie the gap between orange and blue line pre-intervention to the gap between orange and blue lines post-intervention).

data hisp1; 
	input Hispanic time COVID percent_perf;
	datalines;
1 0 0 33.3333
1 1 0 36.0172
1 2 0 34.9497
1 3 0 38.1514
1 4 0 35.0831
1 5 1 38.587
1 6 1 39.7946
1 7 1 37.7932
1 8 1 39.0023
1 9 1 35.4019
0 0 0 31.5364
0 1 0 31.0515
0 2 0 30.2232
0 3 0 32.5556
0 4 0 32.3446
0 5 1 33.6667
0 6 1 33.3706
0 7 1 29.8397
0 8 1 32.5166
0 9 1 29.5235
;
run;

  PROC MIXED DATA = hisp1 METHOD=ML
 PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
 CLASS  Hispanic (ref="0") COVID (ref="0") ;
 MODEL percent_perf = time COVID COVID*time Hispanic Hispanic*time Hispanic*COVID Hispanic*COVID*time/ S ;
 REPEATED intercept / TYPE = UN R;

 RUN;

Thank you!

jiltao · Posted 09-21-2021 01:25 PM

What variable gives you the pre- or post-intervention information?

How is this repeated measures data? Do you have two subjects in the data, each with 10 measurements?

I can help you with an ESTIMATE/CONTRAST statement to test difference in difference, but you might want to first to make sure your model is reasonable for your data.

Thanks,

Jill

Shad · Posted 09-21-2021 01:33 PM

Hi Jill! Thanks for the response. I actually simplified the data/model specification to post here since I wasn't quite sure how to best share a sample of the data set with all the subjects included.

 PROC MIXED DATA = new3 METHOD=ML PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
 	CLASS  hospital_number race (ref="2") COVID (ref="0");
 	MODEL percent_perf = time COVID COVID*time race race*time race*COVID race*COVID*time/ S ;
 	REPEATED  / subject = hospital_number TYPE = ar(1) R;
 RUN;

My data is actually at the hospital level, so I have observations for each hospital at quarterly intervals. The "intervention" is the variable COVID (0 - denoting a time period before the pandemic, 1 - denotes a period after).

Hopefully that clarifies it. 🙂

jiltao · Posted 09-21-2021 03:47 PM

Thanks for the info!

So you are fitting an ANCOVA model. Do you want the DID for the intercept or the slope? I will provide both below --

PROC MIXED DATA = new3 METHOD=ML PLOTS(MAXPOINTS=60000)=(RESIDUALPANEL(UNPACK) VCIRYPANEL(UNPACK));
 	CLASS  hospital_number race (ref="2") COVID (ref="0");
 	MODEL percent_perf = time COVID COVID*time race race*time race*COVID race*COVID*time/ S ;
 	REPEATED  / subject = hospital_number TYPE = ar(1) R;
    estimate 'DID for race*covid when time=0' race*covid 1 -1 -1 1;
    estimate 'DID for the slopes between race*covid' race*covid*time 1 -1 -1 1;
 RUN;

Hope this helps,

Jill

Shad · Posted 09-21-2021 06:19 PM

Thanks Jill!

I'm not sure that's exactly what I'm trying to estimate.

    estimate 'DID for race*covid when time=0' race*covid 1 -1 -1 1;

If I understand this correctly, this would be the estimated mean "jump" in the outcome at the interruption between races. Essentially Beta 6 in the model output (race*covid). Which checks out when comparing the estimate with the model output.

    estimate 'DID for the slopes between race*covid' race*covid*time 1 -1 -1 1;

Similarly, isn't this the estimated difference in slopes after the interruption between races, or B7 time*race*covid.

Perhaps, I'm being silly and that is already explained by the model output. But how would I go about comparing the interruption periods, ie is the gap between races pre-interruptions significantly wider or smaller compared to after the interruption?

jiltao · Posted 09-22-2021 09:46 AM

Because your model has the covariate TIME, you essentially are fitting a regression model for different groups. For your DID request, you need to specify the TIME value. At what TIME value do you want this DID?

Shad · Posted 09-22-2021 12:22 PM

That TIME variable is the piece of the estimate statement that I think I find the most confusing. If I wanted to get the estimate for each TIME would I just add one to the estimate?

estimate 'DID for the slopes between race*covid time = 0' race*covid*time 1 -1 -1 1;
estimate 'DID for the slopes between race*covid time = 1' race*covid*time 1 -1 -1 2;
estimate 'DID for the slopes between race*covid time = 2' race*covid*time 1 -1 -1 3;
.
.
.
estimate 'DID for the slopes between race*covid time = N' race*covid*time 1 -1 -1 N;

Thanks for your help!

jiltao · Posted 09-22-2021 12:36 PM

It does not make sense to compare slopes for a specific time point. You might want to compare the expected response value between different groups at a certain time point. Below are some ESTIMATE statements you might find helpful --

estimate 'DID for race*covid at time = 0' race*covid 1 -1 -1 1;
estimate 'DID for race*covid at time = 1' race*covid 1 -1 -1 1 race*covid*time 1 -1 -1 1;
estimate 'DID for race*covid at time = 2' race*covid 1 -1 -1 1 race*covid*time 2 -2 -2 2;

However, I am not sure if this makes practical sense -- do you have measurements at time 1 for post intervention? Or is it always times 0 to 4 for pre and times 5-9 for post? If so, you might want to reconsider your model specifications, considering what your analysis goal is.

Jill

SteveDenham · Posted 09-24-2021 01:01 PM

The OP might also wish to consider using a generalized linear (mixed) model, since the response variable is a proportion. GENMOD or GLIMMIX seem more appropriate, depending on the need for marginal or conditional means/errors and on the inference space to be used.

SteveDenham