We would like to use PROC GENMOD to analyze non-normal data in the same way that we use PROC GLM (and other linear procedures like PROC MIXED, etc. ) for normal data. In GLM and other similar procedures, the least squares means (LSM) differences output is expressed in the same units as used in the model: for example, if we want to analyze the impact of a new drug or procedure on days of hospitalization of dollars of healthcare cost, the differences in days or dollars appears in the LSM difference output. But in GENMOD, the LSM differences are only expressed in odds ratios--the original units and scaling are lost. Is there a feature of GENMOD, a workaround or custom code available to restore the original units to this output?
RonLev
Ron,
It shouldn't change much--the tinv value doesn't change much once you are this far out. It also explains why I got the 1.96 when I calculated from your output. Given this turn of events, I would now calculate the SE of the diff as:
stddev1 = sqrt(4213) * 3474.49 = 225520.9
stddev2 = sqrt(446134) * 1759.75 = 1175394.4
pooled SEdiff = 3894.71
lower95% = 12759 - 7633.64 = 5125.36
upper95% = 12759 + 7633.64 = 20392.64
which is a little tighter, but remember that this is the confidence interval on the difference, and you are well away from including zero in that interval.
Steve Denham
I doubt very much that you can get the differences or standard errors of differences on the original scale as direct output. Using the ILINK option will give LSmeans on the original scale, and by post-processing, you could calculate differences and approximate asymptotic standard errors of the difference, using the confidence bounds on the back-transformed means to get values to plug through various approximations. For instance, setting ALPHA=0.308 should give approximately 2 SEM as the bounds. Granted these are asymmetric, but an approximate value can be obtained as StdErr/2, which then can be used for each mean to calculate an approximate standard error of the difference. Not pretty, and if you have more than three or four comparisons, not necessarily easy to program.
Steve Denham
Hi Steve,
Oddly, the LSM in PROC GENMOD is already in the original scale, but not the LSM difference, you can get what you are looking for with just
lsmeans classvar/pdiff cl alpha=0.308;
I'm still in the dark about the next step. You now have 2 means (in the original scale) and 2 standard errors (I'm not sure what scale they are in). How do you get a single difference in means and a single confidence interval out of these 2 sets of statistics?
Ron L.
If you have means and standard errors, you can calculate differences and standard errors of the difference. Things work out more easily if you have a balanced design, so that degrees of freedom are equal, but if not:
SE diff = sqrt( 1/ n1 + 1/n2) * s
where n1 is df for mean 1+1, n2 is df for mean 2 +1, and s is the common standard error, which you can get from an elementary stat text (too many square roots, divisions and exponents to do in plain text).
I would change the lsmeans statement to include the ILINK option, so that the bounds you get are on the original scale.
Steve Denham
Steve,
Every "elementary stat text" I've checked uses standard deviations to compute the common (pooled) Standard Error, not standard errors. I tried back-calculating the standard deviation from the 2 standard errors we got from GENMOD ( SD = (N^0.5) * SE) and plugged them into one of the conventional formulas to get the SEM diff but my result is much too small. Can you give me a reference (either on the net, or in text, though indeed it might have "too many square roots, divisions and exponents.......) I'd much appreciate it.
RonLev
Ron,
Point well taken.
OK, I'm still battling some brain congestion, so let's get some more detail. First, what distribution and link are you using in GENMOD? Second, could you post some of the output from the LSMeans part. I think if I can do one specific calculation, it should be a lot easier to generalize.
For instance, for a lognormal distribution, I use the following code (dataset is output from PROC GLIMMIX):
data lsmeans_logn_bt;
set lsmeans_logn;
omega=exp(stderr*stderr*df);
btmean=exp(estimate)*sqrt(omega)-1;
btvar=exp(2*estimate)*omega*(omega-1);
btsem=sqrt(btvar/df);
run;
Note that here I don't calculate from the confidence bounds.
If I had to, say for some other distribution, it would look something like:
/* UNTESTED CODE */
stderr = (upper - lower) / (2 * tinv (alpha, df));
stddev = sqrt (df + 1) * stderr;
and then plug the stddev for each mean into:
sddiff = sqrt ( (stddev1 * stddev1 / (df1 + 1) ) + (stddev2 * stddev2 / (df2 + 1)));
in all this upper and lower are confidence bounds with alpha selected to make them one standard error above and below, and then I indexed the individual calculations. If this isn't working, then I think I really need to see more of what you are getting in your output.
Steve Denham
Steve,
Ah, the inevitable can of worms has been opened.
To get back to first principles, I'm working on factors responsible for healthcare costs, in particular infections (SSI ) on additional costs after hip and knee surgery, adjusting for age, gender and region of the US.. (HC costs are a classic example of a non-normal distribution with a LONG right tail--in the "old" days we used GLM or MIXED with a log transformation of the costs, but retransformation bias caused multible headaches; to avoid that we are transitioning to GENMOD with a dist=gamma and link=log is used. Thanks for your INLINK suggestion--it clears some of the clouds away from the obtuse LSM diff answer SAS gives, but we are still lacking an SEM diff (and the appropriate confidence interval).
The model looks like this:
proc genmod data=hipknee2;
class ssi age gender geo;
model pat_cost=ssi /dist=gamma link=log;
lsmeans SSI/pdiff cl ilink;
run;
The output you are interested in is as follows:
ssi Least Squares Means
Standard
Standard Error of Lower Upper
ssi Estimate Error z Value Pr > |z| Alpha Lower Upper Mean Mean Mean Mean
infected 10.1633 0.1340 75.86 <.0001 0.05 9.9007 10.4259 25933 3474.49 19944 33721
noninfec 9.4860 0.1336 71.01 <.0001 0.05 9.2242 9.7478 13174 1759.75 10139 17117
Differences of ssi Least Squares Means
Standard
ssi _ssi Estimate Error z Value Pr > |z| Alpha Lower Upper
infected noninfec 0.6773 0.01042 65.01 <.0001 0.05 0.6569 0.6977
Getting the mean difference between infected and noninfected patients is easy now that we have ILINK output; So what we need now is (hopefully) find a pooled SEM so we calculate the mean diff confidence intervals.
RonLev
Message was edited by: Ronald Levine Steve, I've copied the above output text to the attachment since it got compressed in the column above.
One other idea, the LSMESTIMATE went out the window, as it looks like the inverse link is applied to the difference, so you would get something like an odds ratio.
So, with an alpha=.05, it looks like tinv=1.96 within rounding error, which suggests df for each mean of infinity. That won't work, so I am going to pick a large n for each group, say n=100.
Thus upper minus lower should be about 3.92 "standard errors".
Infected: 33721 - 19944 = 13777
13777 / 3.92 = 3514.54 (as a standard error)
Stddev1 = sqrt(100) * 3514.54 = 35145.4
Non-infected: 17117 - 10139 = 6978
6978 / 3.92 = 1780.10 (as a standard error)
Stddev2 = sqrt(100) * 1780.10 = 17801.0
Pooled standard error of the difference = sqrt ( (35145.4)*(35145.4)/100 + (17801)*(17801) / 100) = 3939.64 (formula from page 81 of Steel and Torrie, Principles and Procedures of Statistics, 1960)
Thus an approximate 95% confidence interval on the difference of 25933 - 13174 = 12759 would be
Lower 12759 - 3939.64 * 1.972 = 12759-7769.04 = 4989.96
Upper: 12759 + 3939.64 * 1.972 = 12759 + 7769.04 = 20528.04, where 1.972 is the t value for a 2 tailed distribution with 198 degrees of freedom.
I note that the standard errors I came up with from the asymmetric confidence bounds are not much different from what is reported, so you could probably use an ODS statement to capture all of the needed values in a single dataset, never bother with the upper minus lower business, and get what you need fairly easily.
I would recommend moving from GENMOD to GLIMMIX, in case you want to regard any of your factors as random in a hierarchical sense, and to get the degrees of freedom for the estimates in the dataset (as it looks as though GENMOD doesn't print them).
Good luck,
Steve Denham
Steve,
Thanks for your yeoman's job on this. (I wonder if we should charge SAS for the time we waste tryng to make their s/w report what it should have been designed to do from the outset!)
It looks like your imputed confidence intervals (4989.96, 20528.04) are quite wide. I was wondering: you assumed a sample size of N=100 (for each infect and noninfect cohort); the actual sample was N=4213 for the infected cohort and N=446134 for the noninfected cohort. Will these sizes change your calculations?
RonLev
Ron,
It shouldn't change much--the tinv value doesn't change much once you are this far out. It also explains why I got the 1.96 when I calculated from your output. Given this turn of events, I would now calculate the SE of the diff as:
stddev1 = sqrt(4213) * 3474.49 = 225520.9
stddev2 = sqrt(446134) * 1759.75 = 1175394.4
pooled SEdiff = 3894.71
lower95% = 12759 - 7633.64 = 5125.36
upper95% = 12759 + 7633.64 = 20392.64
which is a little tighter, but remember that this is the confidence interval on the difference, and you are well away from including zero in that interval.
Steve Denham
Steve,
You've been a wonderful help on this (rather long) problem. Refreshingly practical, without retreating into staticalese. I hope that SAS will extend the ILINK option to include recomputing the LSM diff as well at some point in the future, but until then I will use your highly useful workaround.
Thanks again,
Ron Levine
You should be able to use ESTIMATE statements via the NLMIXED procedure to obtain contrasts of interest on the original scale. Can you provide us with more details? Is this a binary logistic regression model from which you would like to test for differences in proportions?
Ryan
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.