I am looking for input on the most responsible way to report results. I regularly run models in Proc Glimmix to look at factors that effect exposure to particulate matter. Nearly without exception, running these models with the dist=logn is the most successful approach in terms of convergence and normal distribution of residuals. I struggle with the best way to represent the results. My audience and I are most interested in understanding the results in terms of the original scale: How does exposure differ between groups? What exposure might be expected for each group? I would ideally like to display mean and 95% confidence intervals. Is the best approach to achieve back transformed lsmeans (and back transformed standard errors in order to calculate 95% confidence intervals) as described by SteveDenham on these community boards? Is it okay to calculate the upper and lower confidence limits as I have coded? Or is the best way to convey findings to perform exp(estimates) and actually display the geometric mean, especially if I am including the confidence intervals in any figures and tables? My audience is usually not well versed with the intricacies of generalized linear models.
Sample code below:
Proc Glimmix data=extended plots=all;
class Hay Measurement;
model maxPM2_5=hay/ dist=logn solution;
random horse;
lsmeans hay/ cl plot=meanplot (cl);
ods output lsmeans=lsmeans;
run;
data btlsmeans;
set lsmeans;
omega=exp(stderr*stderr);
btlsmean=exp(estimate)*sqrt(omega);
btvar=exp(2*estimate)*omega*(omega-1);
btsem=sqrt(btvar);
up=btlsmean+(1.96*btsem);
low=btlsmean-(1.96*btsem);
run;
Thank you for any insight that you can provide.
I would favor reporting the geometric mean (exp(estimate)) and the confidence bounds obtained by exponentiating the bounds. Those describe the data in hand. However, if you want the expected value and expected standard error of a population, the back-transformation given will yield that. Rather than using 1.96 as the factor, you might want to use this function:
tvalue=quantile('T',0.975,df);
This will give the upper 97.5th quantile of a t distribution with df degrees of freedom. The degrees of freedom are one of the outputs in the lsmeans dataset, so you have ready access to that value. There is one issue that jumps out at me though, and that is the confidence bound calculated this way are symmetric around the expected value, which is not what you would expect for a lognormal distribution. Per Wikipedia, the quantile function for the lognormal distribution is:
exp(estimate + (sqrt(2 * stderr*stderr) * erf^-1(2*p - 1)), where erf^-1 is the inverse error function (NOT 1 divided by the error function). Break this into parts, use the Maclaurin series approximation for the inverse error function out to about (2*p - 1) to the 7th or 9th power term, and plug that back into the equation. Plug in p=0.025 for the lower bound and p=0.975 for the upper bound.
PLEASE NOTE THAT EVERYTHING IN ITALICS NEEDS TO BE TESTED IN DATA STEP CODE BEFORE TRUSTING THE RESULTS.
SteveDenham
Thank you for your quick response, SteveDenham. That is very helpful!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.