Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
jgreenberg321
Fluorite | Level 6

Hello All,

 

I have created a negative binomial model using generalized estimating equations in order to estimate volume of a surgical procedure. I have been trying to write estimate functions for specific effects but am a bit confused regarding the correct parameter specifications. In the model below, "exposure" and "period" are binary (1/0) variables reflecting policy exposure and treatment period. "confounder1" is a continuous confounder variable and volume is a continuous/count variable for outcome. "hospst" refers to state, which is where the observations are clustered. The basic formula is:

 

proc genmod data=have;

class  exposure (ref='0') period (ref='0')  hospst/;

model volume= confounder1 exposure period exposure*period/ dist=negbin;

repeated subject= hospst/type=ind;

estimate 'period' period 1 -1/exp;

estimate 'exposure' exposure 1 -1/exp;

estimate 'interaction' period*exposure 1 -1 -1 1/exp;

run;

 

Based on the variable specifications and parameter information in the output (attached), I would have thought that the parameters should be specified as 0 and 1. Through trial and error, it seems that it is actually specified as 1/-1. Why is that?

 

Also, some articles online (e.g. https://support.sas.com/kb/24/447.html) describe estimating interaction terms by specifying the constituent term parameters and the intercept as part of the same estimate statement (along with the interaction parameters). That seemed to give me an error, so I just multiplied the (1, -1) parameters for the two terms. Does anyone if additional specification of other terms is needed? Thank you!

 

 

23 REPLIES 23
PaigeMiller
Diamond | Level 26

An estimate is linear combination of the means of levels. If you want to know the difference between two levels, you multiple one level by +1 and one level by -1 and then add.

 

In pseudo math, you want

+1 * (mean of level A) + -1 * (mean of level B)

 

In general, you would not use an ESTIMATE statement for the difference of two means. You would not use an ESTIMATE statement when the coefficients are -1 or +1. These means are easily compared by the LSMEANS statement. Make your life easier and use the LSMEANS statement.

 

You would use an estimate statement when you want coefficients that are not all +1 or -1, such as

0.5 * (mean of level A) + 0.5 * (mean of level B) + -1 * (mean of level C)

 

 

--
Paige Miller
jgreenberg321
Fluorite | Level 6
Thank you very much for the reply. I think I’m this instance, the main rationale for the Estimate statement was to exponentiate the model coefficients, though I can also do that by hand. LSMeans would not help me with that, right?

More generally, if my categorical variables had >= 3 levels, then I assume I would need the estimate statement to specify custom between-group comparisons?

Thank you!
Jacob
PaigeMiller
Diamond | Level 26

Thank you very much for the reply. I think I’m this instance, the main rationale for the Estimate statement was to exponentiate the model coefficients, though I can also do that by hand. LSMeans would not help me with that, right?

You can use the EXP option in the LSMEANS statement in PROC GENMOD.

 

if my categorical variables had >= 3 levels, then I assume I would need the estimate statement to specify custom between-group comparisons?

It depends. If you want to compare level 2 to level 6, LSMEANS will do that for you. If you want to compare the least squares means of levels 2 and 3 combined to the means of levels 4, 5 and 6 combined, then this is a perfect situation for the LSMESTIMATE statement (not the ESTIMATE statement).

--
Paige Miller
jgreenberg321
Fluorite | Level 6

Thank you. I think LSEstimate is in fact what I need. I have been trying to build up toward tackling the more complicated analyses. The more complicated problem I am trying to address is a triple interaction. The model setup is essentially the same. Here the outcome is medicaid insurance status, and the model components are the same, except adding a triple interaction for non-white race to evaluate the triple interaction between non-white race (R), policy exposure (E), and post-expansion period (P). White is the reference category. Ignoring confounders, the basic regression equation for this should look like:

 

y = b(0) + b1*P + b2*E + b3*R + b4*E*P + b5*E*R + b6*P*R + b7*P*E*R.

 

Doing a bit of math, the effect of expansion on non-white races (R=1) compared to white races (R=0) should be: (b6 + b7). How can I use LSEstimate to calculate the mean of these these coefficients (b6 + b7) relative to the reference category of 0, and test whether this comparison is significant? In case is helpful, snapshots of the model output and specification are attached. Thank you very much!

jgreenberg321
Fluorite | Level 6

To be more specific, I am looking to test the statistical significance of the following statement:

 

'exposure:white + period:exposure:white = 0'

SteveDenham
Jade | Level 19

Let's take this in steps.  First, what does your code look like?  Second, what are the LSMeans for your categorical variables?  From the last, we can construct the comparisons of interest in an LSMESTIMATE statement.  The results will be on the log-odds scale.  Finally from that, we can exponentiate (or use the ILINK option, depending on your code) and get the odds ratios.

 

SteveDenham

jgreenberg321
Fluorite | Level 6

Thank you both (@PaigeMiller and @ SteveDenham) for the replies. To rephrase my question in model terms: I am interested in looking at the differential effect of exposure on white vs. non-white patients. In the summary equation below, E=exposure, R=Race, and P=period.

 

y = b(0) + b1*P + b2*E + b3R + b4*E*P + b5*P*R + b6*E*R + b7*P*E*R.

 

The effect on whites should be (b2 + b4), for non-whites should be (b2 + b4 + b6 + b7). Therefore, the differential effect should be (b6 + b7). In my model, this would correspond to the coefficients for exposure*race + period*exposure*race. My SAS model code is below, setup as a linear regression so that model coefficients should provide a direct estimate of the proportion of patients with Medicaid insurance in this particular model. 

 

proc genmod data=have;

class year (ref='2011')  White (ref='1')  exposure (ref='0') period (ref='0') confounder1 hospst/;

model medicaid=   White exposure period exposure*period white*period white*exposure white*exposure*period

confounder1;

repeated subject= hospst/type=ind;

lsmestimate white*exposure*period 'estimate' 1 -1 -1 1 -1 1 1 -1;

run;

 

I may be mistaken, but it seemed  that if I could specify the correct contrast that incorporated both model terms, then I could obtain the p-value for both statements being true (i.e. b6 + b7), relative to 0. For example:

 

contrast 'period:expansion + period:expansion:race = 0'  expansion*period*race

1 -1 -1 1 -1 1 1 -1,  expansion*period 1 -1 -1 1;

 

I obtained the lsmestimate statement above for the triple interaction by (1, -1)*(1,-1)*(1,-1), and I have found that this lsmestimate statement gives the correct p-value for the interaction specified, as reflected in the main model output. I would have thought that the exposure white*exposure could be specified with (1 -1 -1 1). However, when I test that as an independent statement, it does not give the correct p-value as compared to the model coefficient. 

 

StatDave
SAS Super FREQ

It is really always best to avoid using statements where you need to figure out the right coefficients and the right order of them to get the answers you want. That means avoiding the ESTIMATE, CONTRAST, and even LSMESTIMATE statement if possible. When you say your goal is to assess the "differential effect of exposure on white vs. non-white patients", this suggests you just want to see if the exposure/no exposure difference changes for white vs non-white patients. This can be done with a SLICE statement. For example:  

slice e*r / sliceby=r means;

PaigeMiller
Diamond | Level 26

"It is really always best to avoid using statements where you need to figure out the right coefficients and the right order of them to get the answers you want. That means avoiding the ESTIMATE, CONTRAST, and even LSMESTIMATE statement if possible."


Good advice. I think I said something like that earlier in this thread. It should be a maxim.

--
Paige Miller
PaigeMiller
Diamond | Level 26

I have to admit that I have never actually had to do a test like the one you are describing.

 

Nevertheless, it seems as if the SLICE command in PROC GENMOD will allow you to compare each interaction P*R and P*E*R for R=0 versus R=1. But as I understand it, the SLICE command works on a single interaction at a time (it will work on P*R and then of course you can do it in a different SLICE command for P*E*R). Nevertheless, that may give you enough information to come to a conclusion. (Or maybe not).

 

I am also wondering if you can re-parameterize the model such that P*R and P*E*R are now one term in the model so that one SLICE statement is all that is needed, but I need to think about that some more.

--
Paige Miller
jgreenberg321
Fluorite | Level 6

That is an interesting idea. I was thinking of something using the Contrast statement, as I laid out above, though am not sure if that will work exactly as I imagine. 

 

As far as re-parameterizing the model, I'm not quite sure how I could do that, though am certainly open to any innovative thoughts!

SteveDenham
Jade | Level 19

Try running with an LSMEANS statement and the /e option if you are going down the CONTRAST or ESTIMATE road.  

 

If you want to stay with the LSMESTIMATE method, consider that white*exposure can be obtained by finding the marginal mean over period from the white*exposure*period means.  That should look something like

lsmestimate white*exposure*period 'difference in exposure' 1 1 -1 -1 1 1 -1 -1/divisor=2;

lsmestimate white*exposure*period 'difference in white' 1 1 1 1 -1 -1 -1 -1/divisor=2;

and then the combination of both would be the sum of the coefficients to get

lsmestimate white*exposure*period 'difference in exposure and white' 1 1 0 0 0 0 -1 -1;

 

Try these and see if the p values match what is in the solution vector.  If you want to do CONTRAST or ESTIMATE statements, multiply the coefficients from the /e option (the vectors) by the coefficient relative to each LSmean in the LSMESTIMATE statement.  This way you will include all of the two way, one way and intercept coefficients.  Luckily, many will cancel out.

 

SteveDenham

jgreenberg321
Fluorite | Level 6

Thank you for replying. I am honestly happy to go with either the lsmestimate or the Contrast approach. That said, I'm having trouble fully following what you proposed. Probably most important to me following, how did you obtain the lsmestimate specifications for the comparisons you showed (e.g. multiplying different specifications?)? 

 

That said, I tried entering the summed statement you gave (lsmestimate white*exposure*period 'difference in exposure and white' 1 1 0 0 0 0 -1 -1;) into my model, but got a different p-value from the one obtained in the model coefficients. Is this specification supposed to test whether white*exposure is significantly different from 0, as the model coefficient does? I can see just from the LSMeans table that the p-value provided for White=1, Exposure=1 is not the same as the coefficient in the model output. 

 

Adding insult to injury, out of nowhere I am getting the following error message from lsmestimate, even though the model runs fine with this statement removed 😞

 

ERROR: Variable 'period 'n not found.

ERROR 22-322: Syntax error, expecting one of the following: a numeric constant, a datetime constant, +, -, [. 

ERROR 76-322: Syntax error, statement will be ignored.

SteveDenham
Jade | Level 19

Oops, I must apologize - I must have had  period as the odd spelling.

 

So, what does lsmestimate white*exposure*period 'difference in exposure and white' 1 1 0 0 0 0 -1 -1; really do?  It adds the lsmean of white =1, exposure=1, period=1 to white=1, exposure=1 period=2, and then subtracts the sum of white=2, exposure=2, period=1 and white=2, exposure=2, period=2.  thus averaging over both periods and looking at the diagonals of the 2x2 for white and exposure.  That is different to the term for the three way interaction in the solution table. That is the additional bit due to what is in the first cell of a 2x2x2 matrix as compared to the reference values for the two-way and one-way variables. Here is some code that might help make this clearer (I know this is GLIMMIX, but all of the principal ideas are the same)

 

data have;
input ANML_NO	GRP_NO	SEX	MEAS_CD	time	day	t	value;
datalines;
1501	1	1	7004	0	1	0	34.28
1503	1	1	7004	0	1	0	34.18
1504	1	1	7004	0	1	0	41.72
1505	1	1	7004	0	1	0	24.25
3502	1	1	7004	0	1	0	38.09
1501	1	1	7004	4	1	4	34.52
1503	1	1	7004	4	1	4	28.54
1504	1	1	7004	4	1	4	32.47
1505	1	1	7004	4	1	4	23.59
3502	1	1	7004	4	1	4	31.15
2501	2	1	7004	0	1	0	82.96
2502	2	1	7004	0	1	0	23.02
2503	2	1	7004	0	1	0	25.94
2504	2	1	7004	0	1	0	22.56
2505	2	1	7004	0	1	0	.
2501	2	1	7004	4	1	4	77.06
2502	2	1	7004	4	1	4	29.31
2503	2	1	7004	4	1	4	33.21
2504	2	1	7004	4	1	4	23.81
2505	2	1	7004	4	1	4	11.65
1501	1	1	7004	0	36	0	46.17
1503	1	1	7004	0	36	0	53.65
1504	1	1	7004	0	36	0	60.3
1505	1	1	7004	0	36	0	36.42
3502	1	1	7004	0	36	0	40.71
1501	1	1	7004	4	36	4	47.41
1503	1	1	7004	4	36	4	55.05
1504	1	1	7004	4	36	4	61.76
1505	1	1	7004	4	36	4	41.54
3502	1	1	7004	4	36	4	40.3
2501	2	1	7004	0	36	0	71.6
2502	2	1	7004	0	36	0	33.09
2503	2	1	7004	0	36	0	56.24
2504	2	1	7004	0	36	0	37.56
2505	2	1	7004	0	36	0	22.77
2501	2	1	7004	4	36	4	52.77
2502	2	1	7004	4	36	4	25.44
2503	2	1	7004	4	36	4	61.52
2504	2	1	7004	4	36	4	42.75
2505	2	1	7004	4	36	4	19.55
;

proc glimmix data=have;* initglm noprofile;
nloptions tech=quanew;
class grp_no day time anml_no;
model value = grp_no|day|time/dist=lognormal ddfm=bw s;
random day/subject=anml_no type=chol;
random time/subject=anml_no*day type=sp(pow)(t) residual;
lsmeans grp_no*day*time/e;
lsmeans grp_no*day/diff;
lsmestimate grp_no*day*time 'Like yours' 1 1 0 0 0 0 -1 -1/divisor=2 e;
run;

If you run this and look at the table titled 

Coefficients for GRP_NO*day*time Least Squares Means

you will see a bunch of 1's and empty cells.  If you plug in the values from the solution for fixed effects wherever there is a 1, and sum the columns, you will get the least squares means.  Note carefully that the p values in the solution vector do not match the type 3 test p values, except for the highest level interaction.  This is because the effects in the type 3 tests are constructed by adding the appropriate solution values together, thus rendering the parameter p values different to the effect p values.  So, I would not use the p values in the solution for much.

 

Now, on to your original question - 

y = b(0) + b1*P + b2*E + b3R + b4*E*P + b5*P*R + b6*E*R + b7*P*E*R.

 

The effect on whites should be (b2 + b4), for non-whites should be (b2 + b4 + b6 + b7). Therefore, the differential effect should be (b6 + b7). In my model, this would correspond to the coefficients for exposure*race + period*exposure*race. My SAS model code is below, setup as a linear regression so that model coefficients should provide a direct estimate of the proportion of patients with Medicaid insurance in this particular model.

 

OK, translating this to the stuff I put in P=grp_no, E=day, R=time, I get b2*E + b4*E*P or -.2398 - .2483 and b2*E + b4*E*P + b6*E*R +b7*P*E*R or -.2398 - .2483 -.2126 +.3826, finally getting -.2126 +..3826 = .17.  But none of the P*E means have this difference, and the marginal difference between E means is -.3746.  No matter how much I juggle numbers, I can't see how these are related.

 

But look at the lsmeans.  Here I get all of the means for the 2x2x2 arrangement and for the 4 marginal means for the 2x2 combination of grp_no and day (equivalent to exposure and race in your example).  These means in my case are logs of the value on the original scale, so I could get geometric means by exponentiating.  In your case, you would get counts by exponentiating, and ratios of counts by exponentiating the difference between the lsmeans.    So, long story to get here, but the lsmeans should set you on the trail to answering your question.

 

Edited: the stuff in italics in the above paragraph to reflect your use of a negative binomial distribution.

SteveDenham

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 23 replies
  • 8723 views
  • 5 likes
  • 4 in conversation