Statistical Procedures

palolix

Dear SAS Community,

I am using a gamma distribution to analyze my data in proc genmod but I am not so sure if I should use this dist if my outcome variable is a percentage. I compared AIC values between normal and gamma and those for gamma were just slightly lower.

When I plotted my data this is how it looks like:

proc univariate data=one normal;
where Season=2022;
var PercDM;
histogram PercDM;
run;
quit;

This is the code I am using:

proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=gamma link= log;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;

I would greatly appreciate your feedback!

Thanks

Caroline

StatDave

You can still use GLIMMIX if you don't have random effects - just don't include a RANDOM statement. Again, see the usage note I referred to. For your continuous proportions, you can fit the fractional logistic model using either GLIMMIX or LOGISTIC as shown there. In the example there, the RANDOM statement in GLIMMIX is used just to estimate a reasonable scale parameter (it does not add a random effect) in the same way as the SCALE=PEARSON option is used in PROC LOGISTIC in that same note. Both use the binomial distribution. But as I've said, if the data in each separate population exhibits a distribution that is reasonably normal, then an analysis using that distribution might be reasonable. You might want to try both and see what trade-offs there.

View solution in original post

StatDave

The gamma distribution is not bounded at 1 as is a proportion, so it theoretically does not apply. Is your percentage a ratio of two known counts, a numerator count and a total count, like the number of events that occurred out of some possible total? If so, then the response is actually binomial and can be modeled using the events/trial syntax in PROC LOGISTIC. If the percentage is inherently continuous, like a proportion of a chemical in a mixture, then you could consider the types of models discussed in this note using the LOGISTIC or GLIMMIX procedures. However, if the histogram that you show summarizes data in a single population (one setting of both Variety and Harvest), then it appears to be reasonably normal with its mean large enough and its variance small enough to be reasonably symmetric. The penalty in that case with choosing the wrong distribution is small, affecting primarily the size of the standard errors which will, in turn, affect the significance of the tests.

palolix

Thank you so much for your reply StatDave. My outcome var is continuos (percentage of dry matter). I am using proc genmod instead of glimmix because I only have fixed effects. In that case I guess I would use a normal distribution as you suggested. In the case of a continuos outcome var that is an average (fruit weight), should I also use normal instead of gamma?
Thank you StatDave!

StatDave

You can still use GLIMMIX if you don't have random effects - just don't include a RANDOM statement. Again, see the usage note I referred to. For your continuous proportions, you can fit the fractional logistic model using either GLIMMIX or LOGISTIC as shown there. In the example there, the RANDOM statement in GLIMMIX is used just to estimate a reasonable scale parameter (it does not add a random effect) in the same way as the SCALE=PEARSON option is used in PROC LOGISTIC in that same note. Both use the binomial distribution. But as I've said, if the data in each separate population exhibits a distribution that is reasonably normal, then an analysis using that distribution might be reasonable. You might want to try both and see what trade-offs there.

Season

You can try beta regression if your dependent variable lies in [0,1]. As you have said, the dependent variable of your model is a percentage, so it should lie in this interval. Beta regression is particularly suitable for handling heteroscedasticity in the model. As 57480 - Modeling continuous proportions: Normal and Beta Regression Models says, beta regression is supported in the GLIMMIX procedure. Alternatively, see 335-2011: Modeling Percentage Outcomes: The %Beta_Regression Macro for a macro of beta regression.

Ksharp

You could use dist=binomial and link=identity to directly build a model for a ratio/percent, but you need to restructure your data into Y=0 1:
https://support.sas.com/kb/37/228.html
proc genmod data=test descending;
class a;
model y = a / dist=binomial link=identity;
lsmeans a / diff cl;
run;

Or try Possion Regression as StatDave said.
But you also need to change your data structure.
https://support.sas.com/kb/24/188.html

proc genmod data=insure;
class car age;
model c = car age / dist=poisson link=log offset=ln;
run;

palolix

Thanks Ksharp,

So what if it is a continuous percentage like 45.76 and dont want to convert my data, then I guess I will use normal dist in proc genmod or glimmix as StatDave suggested

Ksharp

No. You can't expect percent is conforming to normal distribution when true percent is near zero or one, @Rick_SAS talk this topic at other session.
But from the graph you posted, it is near 0.5 and have bell shape , so I think use Normal or Lognormal dist is decent.

Season

@palolix wrote:
Thanks Ksharp,

So what if it is a continuous percentage like 45.76 and dont want to convert my data, then I guess I will use normal dist in proc genmod or glimmix as StatDave suggested

I think it might be better to take a second look at what @StatDave said. It seems that @StatDave was not telling you to build normal regression models with PROC GLIMMIX. Instead, @StatDave was referring you to PROC GLIMMIX if you chose to build fractional logistic regression models.

Your goal of keeping the dependent variable as it is (i.e., no transformation) might not be possible if you followed @StatDave's suggestion and used the fractional logistic regression model or followed mine and used the beta regression model, as both models involve the logit transformation of the dependent variable.

It is up to you to decide which model to use. But now that you have raised your preference on whether or not transform the dependent variable, I think you may take a look at a statistical field receiving relatively less attention- the regression on the area under curve of the receiver operating characteristic curve (AUC). The AUC is a well-known measure of diagnostic and predictive accuracy and, similar to the dependent variable of your model, lies in the interval [0,1]. Regression models of AUC (i.e., with the AUC as the dependent variable and variables that correlate with the diagnostic or predictive accuracy as the independent variables) have been developed. See, for instance, REGRESSION ANALYSIS FOR THE PARTIAL AREA UNDER THE ROC CURVE. You may take a look on this field to find out if there is a method that is not only capable of modeling doubly-bounded data that lie in [0,1] but also does not need transformation of the dependent variable, namely the goal you proposed.

SteveDenham

Look up leaf area index (LAI) and see what methods have been used for analyzing that endpoint. LAI is the proportion (or percentage/100) of the area of a given plot or transect that is covered by at least one leaf when viewed perpendicular to the ground. It is defined on the interval (0,1), bounded away from zero and one. When I last looked at the analyses that various folks used, there were a lot of options. Some have been mentioned here (beta regression, fractional logistic regression), but I am going to throw my support to some sort of resampling with replacement. Judging from the histogram you have a lot of observations, so taking samples of a relative size of (for example) total plots/20 and generating 5000 samples should not be difficult. From that, you can appeal to the central limit theorem to get means and confidence intervals. This might be more appropriate for your long right tail and non-unimodal data, which really looks like a mixture of two distributions to me.

No guarantees, no warranty implied.

SteveDenham

palolix

Thank you very much for your feedback Steve and the rest of the folks helping me in this post!

I analyzed the data with genmod using a normal distribution and also in glimmix using a beta dist and got almost identical adjusted p-values for the differences between varieties within each harvest. I am just surprised that when comparing the percentage of dry matter between varieties I get so many significant differences at many harvest points even with adjusted p-values (see graph).

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link= log;
slice Harvest*Variety/sliceby=Harvest adjust=simulate(seed=1);
run;

proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;

Season

@palolix wrote:

Thank you very much for your feedback Steve and the rest of the folks helping me in this post!

I analyzed the data with genmod using a normal distribution and also in glimmix using a beta dist and got almost identical adjusted p-values for the differences between varieties within each harvest. I am just surprised that when comparing the percentage of dry matter between varieties I get so many significant differences at many harvest points even with adjusted p-values (see graph).

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link= log;
slice Harvest*Variety/sliceby=Harvest adjust=simulate(seed=1);
run;

Interesting discovery. But I am afraid that you are building a lognormal model instead of a normal one with this code as there is a "link=log" option specified in the MODEL statement. The "link=identity" option keeps the dependent variable as it is and build normal regression models.

You can also try out other options we suggested in this post. In addition to this, now that you have yielded similar P values in the lognormal and beta model, you can work on their goodness-of-fit and statistical diagnostics, two more advanced aspects of statistical modeling. Goodness-of-fit can be measured by goodness-of-fit statistics like Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are rotuinely output in most SAS statistical procedures, including PROC GLIMMIX and PROC GENMOD.

Statistical diagnostics revolves around testing whether the underlying assumption of the statistical model built is tenable. To start with, you can first implement residual diagnostics, which has been studied to such an extent in logistic and beta regression that there is plenty literature to reference.

As I have said, statistical diagnostics is a relatively advanced topic in regression modeling. So it is generally not deemed as a must for the time being, even in the non-statistical academic setting. For instance, take a look at medical research papers and you will find out that few of them stated that the researchers had conducted statistical diagnostics in the modeling process. However, if you do conduct it, the robustness of your results is greatly enhanced.

palolix

Thank you so much for your reply Season! Thanks for pointing out the link mistake. I changed to link=identity but still getting very similar p-values and still significant. When comparing the AIC and BIC values for the genmod and glimmix models I am getting huge differences:

proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;
/*AIC 3984, BIC 4287*/

proc glimmix data=one;
where Season=2022;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;
/*AIC -4342, BIC -4040*/

Do I have to use something like this to check the residuals?

proc reg data=one;
where Season=2022;
model PercDM=Harvest/dw clb;
run;

Thank you very much

Caroline

Season

@palolix wrote:

Thank you so much for your reply Season! Thanks for pointing out the link mistake. I changed to link=identity but still getting very similar p-values and still significant.

Thank you for sharing your interesting discovery! So it seems what you observe is somewhat different from what @StatDave thought would happen. @StatDave said choosing the normal model might lead to altered standard errors and P values. So now the P values are hardly affected by model misspecification. Do you mind sharing what impressions the standard errors of parameter estimates leave on you? You do not need to conduct formal hypothesis testing on the equivalence of standard errors. Simply sharing how you feel about this issue suffices.

@palolix wrote:
proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;
/*AIC 3984, BIC 4287*/

proc glimmix data=one;
where Season=2022;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;
/*AIC -4342, BIC -4040*/

I wonder if you had mistakenly added in the AIC and BIC values of the model produced by PROC GLIMMIX? I have never seen negative AIC and BIC values. I just ran a PROC GLIMMIX code with my data to confirm my suspicion, and its results echoed it- my AIC and BIC values are postitive.

@palolix wrote:
Do I have to use something like this to check the residuals?

proc reg data=one;
where Season=2022;
model PercDM=Harvest/dw clb;
run;

Thank you very much

Caroline

You can use the REG procedure to let SAS compute the residuals, but you do not have to. You can stay in PROC GENMOD and add

output out=xxx(dataset name) RESRAW=r;

to your code. This produces a dataset named xxx where raw residuals are stored in the variable named r. Other kinds of residuals (e.g., deviance residuals) can also be output from PROC GENMOD.

In PROC GLIMMIX, add

output out=xxx resid=r;

to your code. This produces a dataset named xxx where residuals are stored in the variable named r. Again, other kinds of residuals can also be output. Take a look at SAS Help to find out more.

StatDave

My earlier statement:
"The penalty in that case with choosing the wrong distribution is small, affecting primarily the size of the standard errors which will, in turn, affect the significance of the tests."
does not suggest that using the normal distribution will result in larger standard errors. It merely notes that a different distribution will change the standard errors, but that change could be very small or larger. The nature and size of the change depends on the model and data. I later noted that using the normal distribution might provide a reasonable analysis and encouraged trying more than one approach.

Statistical Procedures

Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Re: Distribution for percentages in proc genmod

Proc genmod with normal distribution, log link, unstructured variance-...

genmod distribution

PROC GENMOD Error Interpretation

Built-in probability distribution in PROC GENMOD

PROC GENMOD statistics

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...