BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RonLev
Calcite | Level 5

We would like to use PROC GENMOD to analyze non-normal data in the same way that we use PROC GLM (and other linear procedures like PROC MIXED, etc. ) for normal data.  In GLM and other similar procedures, the least squares means (LSM) differences output is expressed in the same units as used in the model:    for example, if we want to analyze the impact of  a new drug or procedure on days of hospitalization of dollars of healthcare cost, the differences in days or dollars appears in the LSM difference output.   But in GENMOD, the LSM differences are only expressed in odds ratios--the original units and scaling are lost.   Is there a feature of GENMOD, a workaround or custom  code available to restore the original units to this output?

RonLev

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Ron,

It shouldn't change much--the tinv value doesn't change much once you are this far out.  It also explains why I got the 1.96 when I calculated from your output.  Given this turn of events, I would now calculate the SE of the diff as:

stddev1 = sqrt(4213) * 3474.49 = 225520.9

stddev2 = sqrt(446134) * 1759.75 = 1175394.4

pooled SEdiff = 3894.71

lower95% = 12759 - 7633.64 = 5125.36

upper95% = 12759 + 7633.64 = 20392.64

which is a little tighter, but remember that this is the confidence interval on the difference, and you are well away from including zero in that interval.

Steve Denham

View solution in original post

11 REPLIES 11
SteveDenham
Jade | Level 19

I doubt very much that you can get the differences or standard errors of differences on the original scale as direct output.  Using the ILINK option will give LSmeans on the original scale, and by post-processing, you could calculate differences and approximate asymptotic standard errors of the difference, using the confidence bounds on the back-transformed means to get values to plug through various approximations.  For instance, setting ALPHA=0.308 should give approximately 2 SEM as the bounds.  Granted these are asymmetric, but an approximate value can be obtained as StdErr/2, which then can be used for each mean to calculate an approximate standard error of the difference.  Not pretty, and if you have more than three or four comparisons, not necessarily easy to program.

Steve Denham

RonLev
Calcite | Level 5

Hi Steve,

Oddly, the LSM in PROC GENMOD is already in the original scale, but not the  LSM difference,  you can get what you are looking for with just

lsmeans classvar/pdiff cl alpha=0.308;

I'm still in the dark about the next step.   You now  have 2 means (in the original scale) and  2 standard errors (I'm not sure what scale they are in).   How do you get a single difference in means and a single confidence interval out of these 2 sets of statistics?

Ron L.

SteveDenham
Jade | Level 19

If you have means and standard errors, you can calculate differences and standard errors of the difference.  Things work out more easily if you have a balanced design, so that degrees of freedom are equal, but if not:

SE diff = sqrt( 1/ n1 + 1/n2) * s

where n1 is df for mean 1+1, n2 is df for mean 2 +1, and s is the common standard error, which you can get from an elementary stat text (too many square roots, divisions and exponents to do in plain text).

I would change the lsmeans statement to include the ILINK option, so that the bounds you get are on the original scale.

Steve Denham

RonLev
Calcite | Level 5

Steve,

Every "elementary stat text" I've checked uses standard deviations  to compute the common (pooled) Standard Error, not standard errors.  I tried back-calculating  the standard deviation from the 2 standard errors we got from GENMOD ( SD = (N^0.5) * SE)  and plugged them into one of the conventional formulas to get the SEM diff but my result is much too small.    Can you give me a reference (either on the net, or in text, though indeed it might have "too many square roots, divisions and exponents.......)  I'd much appreciate it.

RonLev

SteveDenham
Jade | Level 19

Ron,

Point well taken. 

OK, I'm still battling some brain congestion, so let's get some more detail.  First, what distribution and link are you using in GENMOD?  Second, could you post some of the output from the LSMeans part.  I think if I can do one specific calculation, it should be a lot easier to generalize.

For instance, for a lognormal distribution, I use the following code (dataset is output from PROC GLIMMIX):

data lsmeans_logn_bt;

set lsmeans_logn;

omega=exp(stderr*stderr*df);

btmean=exp(estimate)*sqrt(omega)-1;

btvar=exp(2*estimate)*omega*(omega-1);

btsem=sqrt(btvar/df);

run;

Note that here I don't calculate from the confidence bounds. 

If I had to, say for some other distribution, it would look something like:

/* UNTESTED CODE */

stderr = (upper - lower) / (2 * tinv (alpha, df));

stddev = sqrt (df + 1) * stderr;

and then plug the stddev for each mean into:

sddiff = sqrt ( (stddev1 * stddev1 / (df1 + 1) ) + (stddev2 * stddev2 / (df2 + 1)));

in all this upper and lower are confidence bounds with alpha selected to make them one standard error above and below, and then I indexed the individual calculations.  If this isn't working, then I think I really need to see more of what you are getting in your output.

Steve Denham

RonLev
Calcite | Level 5

Steve,

Ah, the inevitable can of worms has been opened. 

To get back to first principles, I'm working on factors responsible  for healthcare costs, in particular infections (SSI ) on additional costs after hip and knee surgery, adjusting for age, gender and region of the US.. (HC costs are a classic  example of a  non-normal distribution with a LONG right tail--in the "old" days  we used  GLM or MIXED  with a log transformation of the costs, but retransformation bias caused multible headaches;  to avoid that we are transitioning to GENMOD with a dist=gamma and link=log is used.   Thanks  for your INLINK suggestion--it clears some of the clouds away from the obtuse LSM diff answer SAS gives, but we are still lacking an SEM diff (and the appropriate confidence interval).  

The model looks like this:

            proc genmod data=hipknee2;

            class  ssi age gender geo;

            model pat_cost=ssi /dist=gamma link=log;

            lsmeans SSI/pdiff  cl ilink;

            run;

The output you are interested in is as follows:

                                      ssi Least Squares Means

                                                                                         Standard

                      Standard                                                           Error of     Lower     Upper

  ssi       Estimate     Error  z Value  Pr > |z|   Alpha     Lower     Upper      Mean      Mean      Mean      Mean

  infected   10.1633    0.1340    75.86    <.0001    0.05    9.9007   10.4259     25933   3474.49     19944     33721

  noninfec    9.4860    0.1336    71.01    <.0001    0.05    9.2242    9.7478     13174   1759.75     10139     17117

                                        Differences of ssi Least Squares Means

                                             Standard

         ssi         _ssi        Estimate       Error    z Value    Pr > |z|     Alpha       Lower       Upper

         infected    noninfec      0.6773     0.01042      65.01      <.0001      0.05      0.6569      0.6977

Getting the mean difference between infected and noninfected patients is easy now that we have ILINK output;  So what we need now is (hopefully) find a pooled SEM so we calculate the mean diff confidence intervals.

RonLev

Message was edited by: Ronald Levine Steve,  I've copied the above output text to the attachment since it got compressed in the column above.

SteveDenham
Jade | Level 19

One other idea, the LSMESTIMATE went out the window, as it looks like the inverse link is applied to the difference, so you would get something like an odds ratio.

So, with an alpha=.05, it looks like tinv=1.96 within rounding error, which suggests df for each mean of infinity.  That won't work, so I am going to pick a large n for each group, say n=100.

Thus upper minus lower should be about 3.92 "standard errors".

Infected: 33721 - 19944 = 13777

13777 /  3.92 = 3514.54 (as a standard error)

Stddev1 = sqrt(100) * 3514.54 = 35145.4

Non-infected: 17117 - 10139 = 6978

6978 / 3.92 = 1780.10 (as a standard error)

Stddev2 = sqrt(100) * 1780.10 = 17801.0

Pooled standard error of the difference = sqrt ( (35145.4)*(35145.4)/100 + (17801)*(17801) / 100) = 3939.64 (formula from page 81 of Steel and Torrie, Principles and Procedures of Statistics, 1960)

Thus an approximate 95% confidence interval on the difference of 25933 - 13174 = 12759 would be

     Lower 12759 - 3939.64 * 1.972 = 12759-7769.04 = 4989.96

     Upper: 12759 + 3939.64 * 1.972 = 12759 + 7769.04 = 20528.04, where 1.972 is the t value for a 2 tailed distribution with 198 degrees of freedom.

I note that the standard errors I came up with from the asymmetric confidence bounds are not much different from what is reported, so you could probably use an ODS statement to capture all of the needed values in a single dataset, never bother with the upper minus lower business, and get what you need fairly easily.

I would recommend moving from GENMOD to GLIMMIX, in case you want to regard any of your factors as random in a hierarchical sense, and to get the degrees of freedom for the estimates in the dataset (as it looks as though GENMOD doesn't print them).

Good luck,

Steve Denham

RonLev
Calcite | Level 5

Steve,

Thanks for your yeoman's job on this.  (I wonder if we should charge SAS for the time we waste tryng to make their s/w report what it should have been designed to do from the outset!)

It looks like your imputed confidence intervals (4989.96, 20528.04) are quite wide.   I was wondering:  you assumed a sample  size of N=100 (for each infect and noninfect cohort);  the actual sample was N=4213  for the infected cohort and N=446134  for the noninfected cohort.   Will  these sizes change your calculations?

RonLev

SteveDenham
Jade | Level 19

Ron,

It shouldn't change much--the tinv value doesn't change much once you are this far out.  It also explains why I got the 1.96 when I calculated from your output.  Given this turn of events, I would now calculate the SE of the diff as:

stddev1 = sqrt(4213) * 3474.49 = 225520.9

stddev2 = sqrt(446134) * 1759.75 = 1175394.4

pooled SEdiff = 3894.71

lower95% = 12759 - 7633.64 = 5125.36

upper95% = 12759 + 7633.64 = 20392.64

which is a little tighter, but remember that this is the confidence interval on the difference, and you are well away from including zero in that interval.

Steve Denham

RonLev
Calcite | Level 5

Steve,

You've been a wonderful help on this  (rather long) problem.  Refreshingly practical, without retreating into staticalese.   I hope that SAS will extend the ILINK option to include recomputing the LSM diff as well at some point in the future, but until then I will use your highly useful workaround.

Thanks again,

Ron Levine

Ryan
Calcite | Level 5

You should be able to use ESTIMATE statements via the NLMIXED procedure to obtain contrasts of interest on the original scale. Can you provide us with more details? Is this a binary logistic regression model from which you would like to test for differences in proportions?

Ryan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3723 views
  • 0 likes
  • 3 in conversation