Hello,
I am relatively new to SAS and am trying to conduct a DID analysis to determine how rates of 3 different health insurance policies (ins = 0,1, or 2) changed from time t0 to t1 between states that implemented a policy (s=1) or did not implement a policy (s=0)
I don't have individual data, but instead have counts/percentages of insurance rates based on state and time.
I apologize if this is quite basic, but most of the examples I have found either involve differences in means or individual level data.
My data set is below:
ins s t count percent
0 0 0 281 5.3
0 0 1 97 5.0
0 1 0 841 3.4
0 1 1 154 1.8
1 0 0 410 7.7
1 0 1 159 8.3
1 1 0 2488 10.1
1 1 1 1193 14.1
2 0 0 4602 86.9
2 0 1 1671 86.7
2 1 0 21350 86.5
2 1 1 7137 84.1
Appreciate any help or hints!
Edit: In case the data set formatting gets messed up when this posts, I've attached a txt file as well
As always, you should search the SAS Notes and Samples at http://support.sas.com/notes and the list of Frequently-Asked for Statistics at http://support.sas.com/kb/30333 for relevant notes and sample programs. A search there will find this note on estimating and testing the so-called "difference in difference". The second section of that note discusses and illustrates how this is done for a binary response. In your case with aggregated binary data, you need to obtain the denominators of the percentages and then use the events/trials syntax to model the aggregated data in PROC LOGISTIC. You can then proceed as shown there. The only wrinkle here is that I suspect you want separate estimates for the 3 policies. In that case, you need to include INS in the model and interact it with S and T to allow for the policies to have differing DIDs. Note in the results from the LSMEANS statement below that the observed percentages are the "Mean" values since this is a saturated model. You can then specify the DID contrast within each policy as shown in the code below. See the documentation of the NLMeans macro for details and many examples of its use.
data x;
input ins s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 281 5.3
0 0 1 97 5.0
0 1 0 841 3.4
0 1 1 154 1.8
1 0 0 410 7.7
1 0 1 159 8.3
1 1 0 2488 10.1
1 1 1 1193 14.1
2 0 0 4602 86.9
2 0 1 1671 86.7
2 1 0 21350 86.5
2 1 1 7137 84.1
;
proc logistic data=x;
class ins s t / param=glm ref=first;
model count/n = ins|s|t;
lsmeans ins*s*t / e ilink;
ods output coef=coeffs;
store log;
run;
data difdif;
input k1-k12;
set=1;
datalines;
1 -1 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 -1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 -1 1
;
%NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
title=Difference in Difference of Means)
As always, you should search the SAS Notes and Samples at http://support.sas.com/notes and the list of Frequently-Asked for Statistics at http://support.sas.com/kb/30333 for relevant notes and sample programs. A search there will find this note on estimating and testing the so-called "difference in difference". The second section of that note discusses and illustrates how this is done for a binary response. In your case with aggregated binary data, you need to obtain the denominators of the percentages and then use the events/trials syntax to model the aggregated data in PROC LOGISTIC. You can then proceed as shown there. The only wrinkle here is that I suspect you want separate estimates for the 3 policies. In that case, you need to include INS in the model and interact it with S and T to allow for the policies to have differing DIDs. Note in the results from the LSMEANS statement below that the observed percentages are the "Mean" values since this is a saturated model. You can then specify the DID contrast within each policy as shown in the code below. See the documentation of the NLMeans macro for details and many examples of its use.
data x;
input ins s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 281 5.3
0 0 1 97 5.0
0 1 0 841 3.4
0 1 1 154 1.8
1 0 0 410 7.7
1 0 1 159 8.3
1 1 0 2488 10.1
1 1 1 1193 14.1
2 0 0 4602 86.9
2 0 1 1671 86.7
2 1 0 21350 86.5
2 1 1 7137 84.1
;
proc logistic data=x;
class ins s t / param=glm ref=first;
model count/n = ins|s|t;
lsmeans ins*s*t / e ilink;
ods output coef=coeffs;
store log;
run;
data difdif;
input k1-k12;
set=1;
datalines;
1 -1 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 -1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 -1 1
;
%NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
title=Difference in Difference of Means)
Hey StatDave_sas,
Nice response. For this example, is there a good way to add covariates like 'Age' to aggregated data to get an adjusted DID? Seems straightforward to add these covariates with case level data, but with aggregated data, I imagine you'd have to create Age 'groups' (eg. 0-30y = 0, 31-60y = 1, 61+ = 2) to transform it into categorical data, then recalculate counts and % with the new Age column added. Do you have a better way to do this?
This is addressed in the note I referred to earlier and can be done by including the AT and OM options in the LSMEANS statement to fix the additional covariates at desired values. But, as mentioned in the note, a simpler approach is with the Margins macro. While that macro cannot accept data aggregated into events/trials form (as can be used in PROC LOGISTIC), it can be used with data aggregated so that there are separate observations with counts of events and nonevents (as can be used with the FREQ statement in PROC LOGISTIC). If the data are in events/trials form (that is, one observation per population with separate variables containing counts of events and trials), then it is a simple matter to use a DATA step to split each observation into two observations containing a count of events in one and a count of nonevents in the other for each population. Then you can specify freq= in the Margins macro to specify the variable containing the counts. Otherwise, the approach is as discussed in the note.
Thanks! I'm probably approaching this the wrong way, but if your data set already contains the additional covariates you want to control for, as well as the counts and percentages for each group, I thought it would be a simple matter of incorporating them into your model statement, since the proportions of the covariates are already part of your data set.
For example, based off the original code in this thread, if the data set also included Age (0, 1, 2) and Race (0, 1) (I didn't write out all the datalines), you would add Age and Race to your class and model statements, but then I'm not sure what else is needed.
data x;
input ins age race s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 0 0 281 5.3
0 0 0 0 1 97 5.0
0 0 0 1 0 841 3.4
0 0 0 1 1 154 1.8
0 0 1 0 0 410 7.7
0 0 1 0 1 159 8.3
etc...
;
proc logistic data=x;
class ins age race s t / param=glm ref=first;
model count/n = ins|s|t age race;
lsmeans ins*s*t / e ilink;
ods output coef=coeffs;
store log;
run;
data difdif;
input k1-k12;
set=1;
datalines;
1 -1 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 -1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 -1 1
;
%NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
title=Difference in Difference of Means)
Indeed, nothing else is required in this case with categorical covariates if you are content with the default way that LSMEANS computes its estimates averaged over balanced levels in those covariates. Additionally if there is a continuous covariate, the AT option is only needed if you want to fix its value at something other than the mean which is the default for LSMEANS.
That makes sense! I'm playing around with the original data set but I must still be doing something wrong ---
I created a variable called Age with levels (0,1) without changing the group totals from the original data set. So the model data set I created has twice as many rows (since each "ins s t" combination is now split into Age = 0 and Age =1).
I adjusted for this new variable Age as shown in the code below. It ran just fine. But I must be missing something because when I removed Age as a covariate by simply removing it from the model statement, I expected these results to be identical to the original code with the original dataset (from your post on 8/26, which did not have any Age data at all), but they did not match. Shouldn't removing Age as a covariate cause the Age=0 and Age=1 rows for a given (ins s t) combo to be treated as one group; thus the two datasets (with and without Age) should be handled in the same way? What am I missing here?
(Finally just wanted to express gratitude to @StatDave for helping self-taught SAS users like me find some clarity in the fog of countless hours of SAS notes and tutorials and youtube videos!)
data x;
input ins age s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 0 46 0.9
0 0 0 1 18 0.9
0 0 1 0 172 0.7
0 0 1 1 33 0.4
0 1 0 0 235 4.4
0 1 0 1 79 4.1
0 1 1 0 669 2.7
0 1 1 1 121 1.4
1 0 0 0 60 1.1
1 0 0 1 29 1.5
1 0 1 0 442 1.8
1 0 1 1 222 2.6
1 1 0 0 350 6.6
1 1 0 1 130 6.7
1 1 1 0 2046 8.3
1 1 1 1 971 11.4
2 0 0 0 1019 19.3
2 0 0 1 367 19.0
2 0 1 0 4947 20.0
2 0 1 1 1665 19.6
2 1 0 0 3583 67.7
2 1 0 1 1304 67.7
2 1 1 0 16403 66.5
2 1 1 1 5472 64.5
;
proc logistic data=x;
class ins age s t / param=glm ref=first;
model count/n = ins|s|t age;
lsmeans ins*s*t / e ilink;
ods output coef=coeffs;
store log;
run;
data difdif;
input k1-k12;
set=1;
datalines;
1 -1 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 -1 1 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 -1 1
;
%NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
title=Difference in Difference of Means - Adjusted for age)
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.