topic Re: Distribution for percentages in proc genmod in Statistical Procedures

Distribution for percentages in proc genmod

palolix — Wed, 16 Apr 2025 00:30:07 GMT

Dear SAS Community,

I am using a gamma distribution to analyze my data in proc genmod but I am not so sure if I should use this dist if my outcome variable is a percentage. I compared AIC values between normal and gamma and those for gamma were just slightly lower.

When I plotted my data this is how it looks like:

proc univariate data=one normal;
where Season=2022;
var PercDM;
histogram PercDM;
run;
quit;

This is the code I am using:

proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=gamma link= log;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;

I would greatly appreciate your feedback!

Thanks

Caroline

Re: Distribution for percentages in proc genmod

StatDave — Wed, 16 Apr 2025 00:45:44 GMT

The gamma distribution is not bounded at 1 as is a proportion, so it theoretically does not apply. Is your percentage a ratio of two known counts, a numerator count and a total count, like the number of events that occurred out of some possible total? If so, then the response is actually binomial and can be modeled using the events/trial syntax in PROC LOGISTIC. If the percentage is inherently continuous, like a proportion of a chemical in a mixture, then you could consider the types of models discussed in this note using the LOGISTIC or GLIMMIX procedures. However, if the histogram that you show summarizes data in a single population (one setting of both Variety and Harvest), then it appears to be reasonably normal with its mean large enough and its variance small enough to be reasonably symmetric. The penalty in that case with choosing the wrong distribution is small, affecting primarily the size of the standard errors which will, in turn, affect the significance of the tests.

Re: Distribution for percentages in proc genmod

palolix — Wed, 16 Apr 2025 01:40:23 GMT

Thank you so much for your reply StatDave. My outcome var is continuos (percentage of dry matter). I am using proc genmod instead of glimmix because I only have fixed effects. In that case I guess I would use a normal distribution as you suggested. In the case of a continuos outcome var that is an average (fruit weight), should I also use normal instead of gamma?
Thank you StatDave!

Re: Distribution for percentages in proc genmod

StatDave — Wed, 16 Apr 2025 01:49:49 GMT

You can still use GLIMMIX if you don't have random effects - just don't include a RANDOM statement. Again, see the usage note I referred to. For your continuous proportions, you can fit the fractional logistic model using either GLIMMIX or LOGISTIC as shown there. In the example there, the RANDOM statement in GLIMMIX is used just to estimate a reasonable scale parameter (it does not add a random effect) in the same way as the SCALE=PEARSON option is used in PROC LOGISTIC in that same note. Both use the binomial distribution. But as I've said, if the data in each separate population exhibits a distribution that is reasonably normal, then an analysis using that distribution might be reasonable. You might want to try both and see what trade-offs there.

Re: Distribution for percentages in proc genmod

Season — Wed, 16 Apr 2025 05:10:00 GMT

You can try beta regression if your dependent variable lies in [0,1]. As you have said, the dependent variable of your model is a percentage, so it should lie in this interval. Beta regression is particularly suitable for handling heteroscedasticity in the model. As 57480 - Modeling continuous proportions: Normal and Beta Regression Models says, beta regression is supported in the GLIMMIX procedure. Alternatively, see 335-2011: Modeling Percentage Outcomes: The %Beta_Regression Macro for a macro of beta regression.

Re: Distribution for percentages in proc genmod

Ksharp — Wed, 16 Apr 2025 02:43:15 GMT

You could use dist=binomial and link=identity to directly build a model for a ratio/percent, but you need to restructure your data into Y=0 1:
https://support.sas.com/kb/37/228.html
proc genmod data=test descending;
class a;
model y = a / dist=binomial link=identity;
lsmeans a / diff cl;
run;

Or try Possion Regression as StatDave said.
But you also need to change your data structure.
https://support.sas.com/kb/24/188.html

proc genmod data=insure;
class car age;
model c = car age / dist=poisson link=log offset=ln;
run;

Re: Distribution for percentages in proc genmod

palolix — Wed, 16 Apr 2025 03:40:22 GMT

Thanks Ksharp,

So what if it is a continuous percentage like 45.76 and dont want to convert my data, then I guess I will use normal dist in proc genmod or glimmix as StatDave suggested

Re: Distribution for percentages in proc genmod

Ksharp — Wed, 16 Apr 2025 06:35:59 GMT

No. You can't expect percent is conforming to normal distribution when true percent is near zero or one, @Rick_SAS talk this topic at other session.
But from the graph you posted, it is near 0.5 and have bell shape , so I think use Normal or Lognormal dist is decent.

Re: Distribution for percentages in proc genmod

Season — Wed, 16 Apr 2025 07:40:07 GMT

@palolix wrote:
Thanks Ksharp,

So what if it is a continuous percentage like 45.76 and dont want to convert my data, then I guess I will use normal dist in proc genmod or glimmix as StatDave suggested

I think it might be better to take a second look at what @StatDave said. It seems that @StatDave was not telling you to build normal regression models with PROC GLIMMIX. Instead, @StatDave was referring you to PROC GLIMMIX if you chose to build fractional logistic regression models.

Your goal of keeping the dependent variable as it is (i.e., no transformation) might not be possible if you followed @StatDave's suggestion and used the fractional logistic regression model or followed mine and used the beta regression model, as both models involve the logit transformation of the dependent variable.

It is up to you to decide which model to use. But now that you have raised your preference on whether or not transform the dependent variable, I think you may take a look at a statistical field receiving relatively less attention- the regression on the area under curve of the receiver operating characteristic curve (AUC). The AUC is a well-known measure of diagnostic and predictive accuracy and, similar to the dependent variable of your model, lies in the interval [0,1]. Regression models of AUC (i.e., with the AUC as the dependent variable and variables that correlate with the diagnostic or predictive accuracy as the independent variables) have been developed. See, for instance, REGRESSION ANALYSIS FOR THE PARTIAL AREA UNDER THE ROC CURVE. You may take a look on this field to find out if there is a method that is not only capable of modeling doubly-bounded data that lie in [0,1] but also does not need transformation of the dependent variable, namely the goal you proposed.

Re: Distribution for percentages in proc genmod

SteveDenham — Wed, 16 Apr 2025 17:27:18 GMT

Look up leaf area index (LAI) and see what methods have been used for analyzing that endpoint. LAI is the proportion (or percentage/100) of the area of a given plot or transect that is covered by at least one leaf when viewed perpendicular to the ground. It is defined on the interval (0,1), bounded away from zero and one. When I last looked at the analyses that various folks used, there were a lot of options. Some have been mentioned here (beta regression, fractional logistic regression), but I am going to throw my support to some sort of resampling with replacement. Judging from the histogram you have a lot of observations, so taking samples of a relative size of (for example) total plots/20 and generating 5000 samples should not be difficult. From that, you can appeal to the central limit theorem to get means and confidence intervals. This might be more appropriate for your long right tail and non-unimodal data, which really looks like a mixture of two distributions to me.

No guarantees, no warranty implied.

SteveDenham

Re: Distribution for percentages in proc genmod

palolix — Wed, 16 Apr 2025 18:16:01 GMT

Thank you very much for your feedback Steve and the rest of the folks helping me in this post!

I analyzed the data with genmod using a normal distribution and also in glimmix using a beta dist and got almost identical adjusted p-values for the differences between varieties within each harvest. I am just surprised that when comparing the percentage of dry matter between varieties I get so many significant differences at many harvest points even with adjusted p-values (see graph).

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link= log;
slice Harvest*Variety/sliceby=Harvest adjust=simulate(seed=1);
run;

proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;

Re: Distribution for percentages in proc genmod

Season — Thu, 17 Apr 2025 02:16:18 GMT

@palolix wrote:

Thank you very much for your feedback Steve and the rest of the folks helping me in this post!

I analyzed the data with genmod using a normal distribution and also in glimmix using a beta dist and got almost identical adjusted p-values for the differences between varieties within each harvest. I am just surprised that when comparing the percentage of dry matter between varieties I get so many significant differences at many harvest points even with adjusted p-values (see graph).

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link= log;
slice Harvest*Variety/sliceby=Harvest adjust=simulate(seed=1);
run;

Interesting discovery. But I am afraid that you are building a lognormal model instead of a normal one with this code as there is a "link=log" option specified in the MODEL statement. The "link=identity" option keeps the dependent variable as it is and build normal regression models.

You can also try out other options we suggested in this post. In addition to this, now that you have yielded similar P values in the lognormal and beta model, you can work on their goodness-of-fit and statistical diagnostics, two more advanced aspects of statistical modeling. Goodness-of-fit can be measured by goodness-of-fit statistics like Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are rotuinely output in most SAS statistical procedures, including PROC GLIMMIX and PROC GENMOD.

Statistical diagnostics revolves around testing whether the underlying assumption of the statistical model built is tenable. To start with, you can first implement residual diagnostics, which has been studied to such an extent in logistic and beta regression that there is plenty literature to reference.

As I have said, statistical diagnostics is a relatively advanced topic in regression modeling. So it is generally not deemed as a must for the time being, even in the non-statistical academic setting. For instance, take a look at medical research papers and you will find out that few of them stated that the researchers had conducted statistical diagnostics in the modeling process. However, if you do conduct it, the robustness of your results is greatly enhanced.

Re: Distribution for percentages in proc genmod

SteveDenham — Thu, 17 Apr 2025 17:02:55 GMT

Just for fun, consider a modeling approach that doesn't assume homogeneity of variance. Working from your GLIMMIX code, try something like:

/* Changes are in red  */
proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr2;
random _residual_ / group=Variety;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
covtest 'common variance' homogeneity; 
run;

If it turns out that the likelihood ratio test for variance homogeneity for Variety is not significant, try it again grouping by Harvest. I really don't know if either will affect your conclusions, but at least you have dealt with a common assumption (homogeneity) that may not be true for your data.

The other thing to look at is to try a different method than the default RSPL. Consider method=laplace, so that the error variance component is included in the optimization. That may yield standard errors that are more in line with expectations.

SteveDenham

Re: Distribution for percentages in proc genmod

palolix — Thu, 17 Apr 2025 17:13:44 GMT

Thank you so much for your reply Season! Thanks for pointing out the link mistake. I changed to link=identity but still getting very similar p-values and still significant. When comparing the AIC and BIC values for the genmod and glimmix models I am getting huge differences:

proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;
/*AIC 3984, BIC 4287*/

proc glimmix data=one;
where Season=2022;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;
/*AIC -4342, BIC -4040*/

Do I have to use something like this to check the residuals?

proc reg data=one;
where Season=2022;
model PercDM=Harvest/dw clb;
run;

Thank you very much

Caroline

Re: Distribution for percentages in proc genmod

palolix — Thu, 17 Apr 2025 17:54:08 GMT

Thank you so much Steve for keep helping me. Ok, I tried that code and the likelihood ratio test for variance homogeneity for Variety was significant so I didn't change the group. I am still getting the same significant p-values. I tried laplace but same thing.

I am still surprised that I am getting these significant p-values with whatever model I try (see asteriks in graph, it doesn't look to me that those differences in perc dry matter between the varieties are significant, specially when adjusting the p-values). 1,2 and 4 in 2024 are the harvest months.

Re: Distribution for percentages in proc genmod

Season — Fri, 18 Apr 2025 16:20:24 GMT

@palolix wrote:

Thank you so much for your reply Season! Thanks for pointing out the link mistake. I changed to link=identity but still getting very similar p-values and still significant.

Thank you for sharing your interesting discovery! So it seems what you observe is somewhat different from what @StatDave thought would happen. @StatDave said choosing the normal model might lead to altered standard errors and P values. So now the P values are hardly affected by model misspecification. Do you mind sharing what impressions the standard errors of parameter estimates leave on you? You do not need to conduct formal hypothesis testing on the equivalence of standard errors. Simply sharing how you feel about this issue suffices.

@palolix wrote:
proc genmod data=one;
where Season=2022;
class Harvest Variety;
model PercDM=Harvest*Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest sliceby=Variety adjust=simulate(seed=1);
run;
/*AIC 3984, BIC 4287*/

proc glimmix data=one;
where Season=2022;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
run;
/*AIC -4342, BIC -4040*/

I wonder if you had mistakenly added in the AIC and BIC values of the model produced by PROC GLIMMIX? I have never seen negative AIC and BIC values. I just ran a PROC GLIMMIX code with my data to confirm my suspicion, and its results echoed it- my AIC and BIC values are postitive.

@palolix wrote:
Do I have to use something like this to check the residuals?

proc reg data=one;
where Season=2022;
model PercDM=Harvest/dw clb;
run;

Thank you very much

Caroline

You can use the REG procedure to let SAS compute the residuals, but you do not have to. You can stay in PROC GENMOD and add

output out=xxx(dataset name) RESRAW=r;

to your code. This produces a dataset named xxx where raw residuals are stored in the variable named r. Other kinds of residuals (e.g., deviance residuals) can also be output from PROC GENMOD.

In PROC GLIMMIX, add

output out=xxx resid=r;

to your code. This produces a dataset named xxx where residuals are stored in the variable named r. Again, other kinds of residuals can also be output. Take a look at SAS Help to find out more.

Re: Distribution for percentages in proc genmod

StatDave — Fri, 18 Apr 2025 15:53:45 GMT

My earlier statement:
"The penalty in that case with choosing the wrong distribution is small, affecting primarily the size of the standard errors which will, in turn, affect the significance of the tests."
does not suggest that using the normal distribution will result in larger standard errors. It merely notes that a different distribution will change the standard errors, but that change could be very small or larger. The nature and size of the change depends on the model and data. I later noted that using the normal distribution might provide a reasonable analysis and encouraged trying more than one approach.

Re: Distribution for percentages in proc genmod

Season — Fri, 18 Apr 2025 16:21:53 GMT

Thank you for your correction! Yes, I did not remember your message precisely. Sorry for the inconvenience it might cause. I have edited my original post accordingly.

Re: Distribution for percentages in proc genmod

palolix — Fri, 18 Apr 2025 22:14:01 GMT

Thank you for your suggestion. Yes, the AIC and BIC values are correct (negative) . The estimates and SE from the glimmix model using beta dist are much lower than the ones using genmod and normal dist.

I added to the glimmix model

output out=xxx resid=r;

but nothing happened. Maybe I am missing something?

proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
output out=xxx resid=r;
run;

Thank you very much Season!

Re: Distribution for percentages in proc genmod

Season — Sat, 19 Apr 2025 07:35:50 GMT

@palolix wrote:

Thank you for your suggestion. Yes, the AIC and BIC values are correct (negative) .

That is interesting. You might have noticed a message read "(smaller is better)" following "AIC" and "BIC" in SAS output. This indicates that models with smaller AIC or BIC values are better-fit models than those with larger values of them. Given that the AIC's and BIC's of beta regression model are both negative while their counterparts for the normal model are positive, it is demonstrated by the two statistics that the beta model fits the data dramatically better than the normal model does.

@palolix wrote:

The estimates and SE from the glimmix model using beta dist are much lower than the ones using genmod and normal dist.

That is yet another interesting discovery. So it is the case that model misspecification leads to enlarged standard errors in your data. Given that the resultant P values are similar, I speculate that the regression coefficient estimates of the normal model are larger than those of the beta model.

@palolix wrote:
I added to the glimmix model
output out=xxx resid=r;
but nothing happened. Maybe I am missing something?

proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr;
output out=xxx resid=r;
run;

Thank you very much Season!

The OUTPUT statement writes a dataset to a SAS library where datasets are stored. In response to your code, the dataset named "xxx" is stored in the Work library. You can visit that library to find it.