BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sasuser2222
Calcite | Level 5

I am trying to teach myself how to adjust for categorical covariates in difference-in-difference analysis. 

 

I am playing around with a data set (posted below) previously posted in a SAS community question, examining how rates of 3 different health insurance policies (ins = 0,1, or 2) changed from time t0 to t1 between states that implemented a policy (s=1) or did not implement a policy (s=0).

data x;
input ins s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 281 5.3
0 0 1 97 5.0
0 1 0 841 3.4
0 1 1 154 1.8
1 0 0 410 7.7
1 0 1 159 8.3
1 1 0 2488 10.1
1 1 1 1193 14.1
2 0 0 4602 86.9
2 0 1 1671 86.7
2 1 0 21350 86.5
2 1 1 7137 84.1
;
      proc logistic data=x;
        class ins s t / param=glm ref=first;
        model count/n = ins|s|t;
        lsmeans ins*s*t / e ilink;
        ods output coef=coeffs;
        store log;
        run;
      data difdif;
        input k1-k12;
        set=1;
        datalines;
        1 -1 -1 1   0 0 0 0     0 0 0 0
        0 0 0 0     1 -1 -1 1   0 0 0 0
        0 0 0 0     0 0 0 0     1 -1 -1 1   
        ;
      %NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
               title=Difference in Difference of Means)

This code above does not adjust for covariates, but gives an accurate D-I-D analysis.

 

Now, I want to adjust for a variable Age (two categories: Age=0 or Age=1), which I added to the data set below, with updated counts and percentages for each row.

 

I adjusted for this new variable Age as shown in the code below. It ran just fine. But I must be missing something because when I removed Age as a covariate by simply removing it from the model statement, I expected these results to be identical to the original code with the original dataset (from your post on 8/26, which did not have any Age data at all), but they did not match. Shouldn't removing Age as a covariate cause the Age=0 and Age=1 rows for a given (ins s t) combo to be treated as one group; thus the two datasets (with and without Age) should be handled in the same way? What am I missing here?

 

data x;
input ins age s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 0 46 0.9
0 0 0 1 18 0.9
0 0 1 0 172 0.7
0 0 1 1 33 0.4
0 1 0 0 235 4.4
0 1 0 1 79 4.1
0 1 1 0 669 2.7
0 1 1 1 121 1.4
1 0 0 0 60 1.1
1 0 0 1 29 1.5
1 0 1 0 442 1.8
1 0 1 1 222 2.6
1 1 0 0 350 6.6
1 1 0 1 130 6.7
1 1 1 0 2046 8.3
1 1 1 1 971 11.4
2 0 0 0 1019 19.3
2 0 0 1 367 19.0
2 0 1 0 4947 20.0
2 0 1 1 1665 19.6
2 1 0 0 3583 67.7
2 1 0 1 1304 67.7
2 1 1 0 16403 66.5
2 1 1 1 5472 64.5
;
      proc logistic data=x;
        class ins age s t / param=glm ref=first;
        model count/n = ins|s|t age;
        lsmeans ins*s*t / e ilink;
        ods output coef=coeffs;
        store log;
        run;
      data difdif;
        input k1-k12;
        set=1;
        datalines;
        1 -1 -1 1   0 0 0 0     0 0 0 0
        0 0 0 0     1 -1 -1 1   0 0 0 0
        0 0 0 0     0 0 0 0     1 -1 -1 1   
        ;
      %NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
               title=Difference in Difference of Means - Adjusted for age)

 

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

No. N is calculated in each observation and then summed. And Yes, the total N is the sum of the two observations in each INS/S/T population. The PROC MEANS code I showed does exactly that - sum the event counts and the total N in the two observations in each of those populations. That shows that the total N in each INS/S/T population is NOT the same as in the original data without AGE. The PROC SUMMARY code I showed creates the summarized data set in this way (NOT using the percents). If you run your analysis without AGE on both data sets as I mentioned, the results are the same. I also showed this with my earlier post that expanded the original data to maintain the totals. 

View solution in original post

8 REPLIES 8
SteveDenham
Jade | Level 19

In the code presented, Age is still in the model.  It only has been removed from the lsmeans statement.  Since it is still in the model statement, the reduced lsmeans give equal weight to each of the two Age groups, so that the results are different.

 

If the code presented is not what you ran, then there are other issues to address.

 

SteveDenham

StatDave
SAS Super FREQ

If you examine your second data set that includes AGE, the total N is not the same in each INS/S/T population. Try it with data that properly splits each of those populations into two AGE groups. For example:

data x2;
input ins s t count percent;
n=round(count/(percent/100));
prop=count/n;
n1=round(ranuni(969)*n);
n2=n-n1;
c1=round(prop*n1);
c2=count-c1;
if c1>n1 or c2>n2 then put 'bad';
keep ins s t age c nn count n;
age=0; c=c1; nn=n1; output;
age=1; c=c2; nn=n2; output;
datalines;
0 0 0 281 5.3
0 0 1 97 5.0
0 1 0 841 3.4
0 1 1 154 1.8
1 0 0 410 7.7
1 0 1 159 8.3
1 1 0 2488 10.1
1 1 1 1193 14.1
2 0 0 4602 86.9
2 0 1 1671 86.7
2 1 0 21350 86.5
2 1 1 7137 84.1
;
sasuser2222
Calcite | Level 5

Ah, you're right! I must have had a typo splitting the data into Age categories. This dataset below PROPERLY splits the original INS/S/T groups from the original dataset into Age=0 and Age=1. However, even with this corrected dataset, I'm still running into the same issue:

 

-The code below shows how I am adjusting for Age as a covariate

-Next, when I modify the code below to REMOVE Age as a covariate, I do this by removing it from the model statement (model count/n = ins|s|t). I expect this to give the same results as the original code where INS/S/T were NOT split into Age groups at all (see first post in this thread where the input data does not even contain an Age column), but the DID outputs do not match and I'm stumped!

 

data x;
input ins age s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 0 183 3.5
0 0 0 1 69 3.6
0 0 1 0 527 2.1
0 0 1 1 104 1.2
0 1 0 0 98 1.9
0 1 0 1 28 1.5
0 1 1 0 314 1.3
0 1 1 1 50 0.6
1 0 0 0 278 5.3
1 0 0 1 94 4.9
1 0 1 0 1646 6.7
1 0 1 1 779 9.2
1 1 0 0 132 2.5
1 1 0 1 65 3.4
1 1 1 0 842 3.4
1 1 1 1 414 4.9
2 0 0 0 2687 50.8
2 0 0 1 960 49.8
2 0 1 0 12322 49.9
2 0 1 1 4066 47.9
2 1 0 0 1915 36.2
2 1 0 1 711 36.9
2 1 1 0 9028 36.6
2 1 1 1 3071 36.2
;
      proc logistic data=x;
        class ins age s t / param=glm ref=first;
        model count/n = ins|s|t age;
        lsmeans ins*s*t / e ilink;
        ods output coef=coeffs;
        store log;
        run;
      data difdif;
        input k1-k12;
        set=1;
        datalines;
        1 -1 -1 1   0 0 0 0     0 0 0 0
        0 0 0 0     1 -1 -1 1   0 0 0 0
        0 0 0 0     0 0 0 0     1 -1 -1 1   
        ;
      %NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
               title=Difference in Difference of Means - Adjusted for age)

 

SteveDenham
Jade | Level 19

Well, perhaps this will help.  I calculated the means of the numerator and denominator that you are fitting for the class variable Age.

 

I got this:

 

SteveDenham_0-1634150290942.png

While this ignores the other factors in the model, this demonstrates that simply removing Age from the model (_type_ = 0) results in very different segmentation of the data.  Without Age in the model, count=1682.625, n = 10074.54, so the estimated probabilities, and the consequent differences, are based on the averaged Age response = 0.167.  With Age in the model: Age=0 bases the estimates on an average of 1976.25/10125.3333 = 0.195, while Age=1 is based on 1389/10023.75 = 0.139.  If everything were linear, the removal would make no difference.  However, the logistic model works on the logit scale, and logit(0.167) does not equal the mean of logit(0.195) and logit(0.139).  So it should not be surprising that the difference in differences end up slightly different, as the two Age categories do not receive equal weight.

 

This is, perhaps, an oversimplified way of looking at this.  Actually all of the logit estimates for the other terms in the model are changed to accommodate the difference due to Age, which is ignored when it is excluded from the model.

SteveDenham

StatDave
SAS Super FREQ

The data is still not right as you can see by comparing the original data without AGE to the summarization of your data:

proc means data=x sum;
class ins s t; var count n;
run;

The total Ns are not the same in the two data sets.

You're getting hung up on how to split the data. Forget about that. Start with your expanded data set containing AGE (however you create it), then summarize it. Then fit the model without AGE to your expanded data and to the summarized data. This creates the summarized version of your data:

proc summary data=x nway;
class ins s t; var count n;
output out=out sum=totcount totn;
run;

Now do the analysis on each.

sasuser2222
Calcite | Level 5

@StatDave with your last comment I think I've discovered that we're calculating N differently:

 

When I look at the summary output from your code, the TOTN values are super high, and I'm not sure how those are calculated

 

data x;
input ins age s t count percent;
n=round(count/(percent/100));
datalines;
0 0 0 0 183 3.5
0 0 0 1 69 3.6
0 0 1 0 527 2.1
0 0 1 1 104 1.2
0 1 0 0 98 1.9
0 1 0 1 28 1.5
0 1 1 0 314 1.3
0 1 1 1 50 0.6
1 0 0 0 278 5.3
1 0 0 1 94 4.9
1 0 1 0 1646 6.7
1 0 1 1 779 9.2
1 1 0 0 132 2.5
1 1 0 1 65 3.4
1 1 1 0 842 3.4
1 1 1 1 414 4.9
2 0 0 0 2687 50.8
2 0 0 1 960 49.8
2 0 1 0 12322 49.9
2 0 1 1 4066 47.9
2 1 0 0 1915 36.2
2 1 0 1 711 36.9
2 1 1 0 9028 36.6
2 1 1 1 3071 36.2
;

proc summary data=x nway;
class ins s t; var count n;
output out=out sum=totcount totn;
run;

 

 Capture1.PNG

 

I'm under the impression that TOTN should be calculated as follows: summarizing the data should combine the AGE=0 and AGE=1 row for a given INS/S/T. The counts are additive, as are the percents (because they are percents of the same denominator (aka a given S/T combination)). For example, for INS/S/T=0/0/0, adding together the AGE=0 and AGE=0 rows gives: total count = 183+98 = 281, total percent = 3.5+1.9 = 5.4. Total N can then be calculated from total count and total percent.

 

So I ran the code below, where I summarized the data (which gives me TOTCOUNT and TOTPERCENT), and then I calculated TOTN from those variables. Modeling TOTCOUNT/TOTN in my DID code gives me the same output from the DID analysis of the original dataset not split into Age.

 

Doesn't this mean that the dataset is not the issue? Because I thought that taking Age out of the model statement in my DID code (from my initial post) essentially causes Age = 0 and Age =1 for a given INS/S/T to be added together in the same way I summarized the data below (which worked!). But clearly I'm still missing something, because these two actions (taking age out of model statement vs summarizing) do not yield the same result, and thus I know that my adjustment for the Age covariable is incorrect. Ultimately, I still can't figure out how to adjust for Age correctly.

 

data x;
input ins age s t count percent;
datalines;
0 0 0 0 183 3.5
0 0 0 1 69 3.6
0 0 1 0 527 2.1
0 0 1 1 104 1.2
0 1 0 0 98 1.9
0 1 0 1 28 1.5
0 1 1 0 314 1.3
0 1 1 1 50 0.6
etc.
;

proc summary data=x nway;
class ins s t; var count percent;
output out=out sum=totcount totpercent;
run;

data xNoAge;
	set out;
	totn=round(totcount/(totpercent/100));
run;

proc print data=xNoAge;
run;

proc logistic data=xNoAge;
        class ins s t / param=glm ref=first;
        model totcount/totn = ins|s|t;
        lsmeans ins*s*t / e ilink;
        ods output coef=coeffs;
        store log;
        run;
      data difdif;
        input k1-k12;
        set=1;
        datalines;
        1 -1 -1 1   0 0 0 0     0 0 0 0
        0 0 0 0     1 -1 -1 1   0 0 0 0
        0 0 0 0     0 0 0 0     1 -1 -1 1   
        ;
      %NLMeans(instore=log, coef=coeffs, link=logit, contrasts=difdif,
               title=Difference in Difference of Means)

Capture2.PNGCapture3.PNG

StatDave
SAS Super FREQ

No. N is calculated in each observation and then summed. And Yes, the total N is the sum of the two observations in each INS/S/T population. The PROC MEANS code I showed does exactly that - sum the event counts and the total N in the two observations in each of those populations. That shows that the total N in each INS/S/T population is NOT the same as in the original data without AGE. The PROC SUMMARY code I showed creates the summarized data set in this way (NOT using the percents). If you run your analysis without AGE on both data sets as I mentioned, the results are the same. I also showed this with my earlier post that expanded the original data to maintain the totals. 

sasuser2222
Calcite | Level 5

Thanks for all the help. I need to play around with it some more because I'm still getting the same count totals compared to the original data when I'm adding up the Age=0 and Age=1 counts for a given INS/S/T group, which implies to me that the total N should be the same. Anyway, I'm sure it's something I'm screwing up, but I really appreciate all the input.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 883 views
  • 1 like
  • 3 in conversation