Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
wateas
Obsidian | Level 7

I have data on the severity of weed pressure for an agricultural study with a factorial, RCBD.  The weed pressure variable is the area in square footage within a plot where weeds were measured to be above a certain height.  It is an atypical metric for weediness, and I have not been able to find anything in the literature like it, so I'm unsure of how to analyze this data.

 

The raw, untransformed variable "Area_Weed", in units of square feet, exhibits right-skewness.  A few of the observations are zero.  The histogram is shown below:

Histogram of Area_Weed.png

 

One approach that I've tried is to convert the response variable "Area_Weed", in square feet, to a proportion of weed area divided by total plot area, the apply an arcsine transformation.  The residuals plot exhibits non-constant variance.  The code and residuals are shown below:

proc glimmix data=df plots=studentpanel method=rspl;
    class Trt_Amend_App Trt_CC Block;
	model Area_Weed_PROP_ANG = Trt_Amend_App | Trt_CC / ddfm=kr2;
	random Block;
run;

Area_Weed_PROP_ANG - residual plot.png

 

Considering an alternative distribution for the untransformed response variable "Area_Weed", which is positive and right-skewed, the gamma distribution seems promising, aside from the fact that it can't accommodate zeros in the data.  If I apply a transformation of adding a small constant (+1) then I can get the model to run using PROC GENMOD, with the caveat that I can't include the variable "Block" as a random effect.

proc genmod data=df plots=all;
    class Trt_Amend_App Trt_CC Block;
	model Area_Weed_PLUS = Trt_Amend_App | Trt_CC / dist=gamma link=log;
	ods output ParameterEstimates=pe;
    output out=outmean pred=mu;
run;

My questions amount to the following:

  1. Based on what I've presented here, is the gamma distribution appropriate for my response variable, which is positive, right-skewed, and includes a few zeros?
  2. If I use PROC GENMOD and a gamma distribution to model, how might I go about evaluating the residual plots?  What might be a good option for the parameter plots aside from "all"?
  3. If I attempt to incorporate Block as a random effect using PROC GLIMMIX, the model does not converge.  How can I address this issue?

Thank you for reading.  Please let me know if I can provide any other information.

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

You need to add an NLOPTIONS statement to your code. The log likelihood is converging, but it is running into the default maximum number of iterations (=19).

 

Try adding:

 

nloptions maxiter=500;

to your GLIMMIX code. Alternatively, you might wish to set the ABSGCONV to 1e-6 (or something like that), as it looks like the max gradient in the iteration history is cycling around at values less than this.

 

Also, with a non-normal distribution, you may want to change from method=rspl to method=quad (or method=laplace if there are issues with the number of points for adaptive quadrature). However, if you do change to method=quad, you will need to change the RANDOM statement to:

 

random intercept/subject=block;

as the quadrature method requires that the data be processed by subject. I think you are right for the rest with the gamma.

 

Regarding plot size, does that put an upper bound on the dependent variable? If the areas being analyzed substantially less than that upper bound then there shouldn't be an issue, but it could possibly result in things like an lsmean being larger than any of the plots if you fit a gamma and a substantial portion of the dependent variables are greater than one-half of the maximum plot size..

 

In that case, consider fitting a four parameter logistic model, with a random effect, using PROC NLMIXED. There are examples out on the interwebs for that approach. Here is something really simple:

 

proc nlmixed data=mydata;
  model response = a + (b-a)/(1+exp(c*(x-d))) / dist=normal;
  parms a=0.0 b=1.0 c=1.0 d=0.0; /*replace b=1.0 with b=<max plot size> */
  random intercept / subject=block;
run;

Just something to consider. If there are treatments applied like you have, the code gets a whole boatload more complicated, but there are examples out there on how to incorporate those.

 

SteveDenham

PS - the arcsine transformed data is an approximation of the logistic model, so that may be why it is looking appropriate.

View solution in original post

14 REPLIES 14
StatDave
SAS Super FREQ

If you know the number of square feet in each plot and you measure the number of square feet with weeds above the required height, then the ratio is just a binomial proportion that you could model with an ordinary logistic model. No need to do any transformation. With a data set containing one observation per plot and with a variable containing the number of affected square feet in the plot and another with the total number of square feet in the plot, you could use the events/trials response syntax in PROC LOGISTIC to fit the model.

model Naffected/Ntotal = ... ;

You could include your BLOCK variable in the model if you have blocks of plots. Or, if you really want to use a random effect, then you could use the same model syntax, with DIST=BINOMIAL, in PROC GLIMMIX.

wateas
Obsidian | Level 7

@StatDave wrote:

If you know the number of square feet in each plot and you measure the number of square feet with weeds above the required height, then the ratio is just a binomial proportion that you could model with an ordinary logistic model. No need to do any transformation. With a data set containing one observation per plot and with a variable containing the number of affected square feet in the plot and another with the total number of square feet in the plot, you could use the events/trials response syntax in PROC LOGISTIC to fit the model.

model Naffected/Ntotal = ... ;

You could include your BLOCK variable in the model if you have blocks of plots. Or, if you really want to use a random effect, then you could use the same model syntax, with DIST=BINOMIAL, in PROC GLIMMIX.



I've only seen binomial distribution used in the context of binomial outcomes (yes/no, live/die, etc.).  It makes sense to me why I could use the binomial distribution, but I'm concerned that a reviewer would take issue using a response variable that is continuous in nature.  If you happen to be aware of any examples or tutorials of using a binomial distribution for a continuous response variable, anything you can pass along would be much appreciated!

PaigeMiller
Diamond | Level 26

In one of your outputs, the distribution of the residuals seems to be approximately normal, thus you could use a regular old PROC GLM or PROC GLIMMIX with the normal distribution. There is no requirement for GLM to work with normal data, the requirement is normally distributed residuals, which you seem to have.

--
Paige Miller
wateas
Obsidian | Level 7
You're not concerned about that fanning in the residuals plot? I'm still getting accustomed to interpreting residuals plots, so its hard for me to know what constitutes an issues vs. a "moderate" deviation that the model should be robust enough to handle.
Ksharp
Super User
“ my response variable, which is positive, right-skewed, and includes a few zeros?”

If Y was greater than zero and contains some zero, you could try Tweedie Distribution.
SteveDenham
Jade | Level 19

I think you are doing well so far. Here are some points to consider, that I can't determine from the presentation:

 

Design-wise, are the plots identical in area? If not, then the binomial distribution referred to later on may not be appropriate. That may require something like a beta distribution.

 

Although the quantile plot you present doesn't seem too bad, your data may be zero-inflated or a hurdle model might be appropriate. Before you go down that path though, you need to think about what process could lead to excess zeroes. Stroup's text (Generalized Linear Mixed Models, 2013) has a section that uses NLMIXED to fit excess zeroes for count data, and the code could be modified to fit distributions other than Poisson or negative binomial.

 

Can you share your GLIMMIX code, and the iteration history? There may be some easy tweaks to enable the model to converge.

 

SteveDenham

wateas
Obsidian | Level 7

Thanks for your reply @SteveDenham.  To answer your first question, the plots are identical in size, or at least very similar. 

 

The code below is showing how I've attempted to fit the model using a gamma distribution in GLIMMIX.   I've added a small constant (+1) to avoid zeros (variable named "Area_Weed_Plus").  I've followed the code with a few screenshots showing the output and the failure to converge.

 

proc glimmix data=df plots=studentpanel method=rspl;
    class Trt_Amend_App Trt_CC ID_S Block;
	model Area_Weed_PLUS = Trt_Amend_App | Trt_CC / dist=gamma link=log;
	random Block;
run;

Screenshot 2025-03-31 104259.pngScreenshot 2025-03-31 104357.pngScreenshot 2025-03-31 104423.png

 

So far, the most promising model that I have is the normal distribution in GLIMMIX with the arcsine-transformed response variable.  That model shows some fanning in the residuals, but so have the other models that I've tried.  The gamma and binomial models successfully converge when Block is not included as a random effect.

Thank you for your time and attention.

SteveDenham
Jade | Level 19

You need to add an NLOPTIONS statement to your code. The log likelihood is converging, but it is running into the default maximum number of iterations (=19).

 

Try adding:

 

nloptions maxiter=500;

to your GLIMMIX code. Alternatively, you might wish to set the ABSGCONV to 1e-6 (or something like that), as it looks like the max gradient in the iteration history is cycling around at values less than this.

 

Also, with a non-normal distribution, you may want to change from method=rspl to method=quad (or method=laplace if there are issues with the number of points for adaptive quadrature). However, if you do change to method=quad, you will need to change the RANDOM statement to:

 

random intercept/subject=block;

as the quadrature method requires that the data be processed by subject. I think you are right for the rest with the gamma.

 

Regarding plot size, does that put an upper bound on the dependent variable? If the areas being analyzed substantially less than that upper bound then there shouldn't be an issue, but it could possibly result in things like an lsmean being larger than any of the plots if you fit a gamma and a substantial portion of the dependent variables are greater than one-half of the maximum plot size..

 

In that case, consider fitting a four parameter logistic model, with a random effect, using PROC NLMIXED. There are examples out on the interwebs for that approach. Here is something really simple:

 

proc nlmixed data=mydata;
  model response = a + (b-a)/(1+exp(c*(x-d))) / dist=normal;
  parms a=0.0 b=1.0 c=1.0 d=0.0; /*replace b=1.0 with b=<max plot size> */
  random intercept / subject=block;
run;

Just something to consider. If there are treatments applied like you have, the code gets a whole boatload more complicated, but there are examples out there on how to incorporate those.

 

SteveDenham

PS - the arcsine transformed data is an approximation of the logistic model, so that may be why it is looking appropriate.

wateas
Obsidian | Level 7

Thanks all for the input.  @SteveDenham setting the iteration limit to a higher value allowed the procedure to execute properly, and the results look good.  The residuals look to be the best that I've seen (minimal fanning and patterns), and the results themselves are congruent with my knowledge of the data.  I don't appear to have issues with the lack of an upper bound, as only a few of the observations get close to the total plot area (~500 sq ft, highest observations are ~350 sq ft).

 

Just a few remaining questions:

  1. Is adding a small constant (+1) to the response variable kosher for preventing the loss of data from zeros getting tossed out?  From my perspective, most of these plots have at least some small degree weed pressure, so it makes sense in that regard.  I'm wondering more so if there is some subtlety related to the statistics that I may not be realizing.
  2.  Can I use the ilink option to obtain lsmeans in the original units of the response variable as shown below?
	lsmeans Trt_Amend_App / lines adjust=tukey ilink;
	lsmeans Trt_CC / lines ilink;

Thank you all!

Season
Barite | Level 11

@wateas wrote:

Thanks all for the input.  @SteveDenham setting the iteration limit to a higher value allowed the procedure to execute properly, and the results look good.  The residuals look to be the best that I've seen (minimal fanning and patterns), and the results themselves are congruent with my knowledge of the data.  I don't appear to have issues with the lack of an upper bound, as only a few of the observations get close to the total plot area (~500 sq ft, highest observations are ~350 sq ft).

 

Just a few remaining questions:

  1. Is adding a small constant (+1) to the response variable kosher for preventing the loss of data from zeros getting tossed out?  From my perspective, most of these plots have at least some small degree weed pressure, so it makes sense in that regard.  I'm wondering more so if there is some subtlety related to the statistics that I may not be realizing.

So does it mean that for the time being, all of the observations with the dependent variable equal zero are excluded? If that is the case, then I do not think all of your modeling work has been done. I personally object the addition of a small constant in the modeling process, as the choice of that constant is arbitrary. In addition, the presence of transformation of the dependent variable complicates the issue. When do you decide to add that constant (pre- or post-transformation)? If the former is the case, then things are relatively simple. You can say that you are modeling y+1 instead of y. If the latter is the case, then the interpretation of the regression coefficients are not that intuitive. That is why I suggested zero-inflated models in my last response.

In addition, I am glad that the issue of residual plot fanning, a phenomenon caused by heteroscedasticity, has been alleviated. Heteroscedasticity is common in zero-inflated models. In addition, it has been shown that certain models for zero-inflated continuous data (e.g., Tobit model) are more sensitive to heteroscedasticity than do their ordinary counterparts (multiple regression in the case of Tobit model).

While transformations like the Box-Cox or the one you have applied are ways to tackle it, an alternative approach is to jointly model the mean and the variance, namely build joint mean and variance models. After all, the selection of transformation method (e.g., arcsine instead of Box-Cox) is somewhat arbitrary. By the way, statistical tests for presence of heteroscedasticity have been developed in the context of multiple linear regression (e.g., Breusch-Pagan test), but I am not sure if statistical tests for examining whether heteroscedasticity exists in the setting of zero-inflated models have been developed at the time of this writing. You may need to search on the web if you decide to work harder on heteroscedasticity. However, few of SAS's built-in modules seem to be capable of building joint mean and variance models. If you believe that heteroscedasticity has been resolved satisfactorily, then you may ignore the joint mean and variance modeling issue.

StatDave
SAS Super FREQ

There is no problem using the binomial distribution if your interest is in testing the effect of your predictors, whatever they are, on the probability of weed pressure existing. In that case, the response is fundamentally binary - each plot is either considered to be affected by weed pressure or not based on your height threshold. It is common for investigators to observe some continuously-valued phenomenon but only be able to measure it reliably at a binary or ordinal level and to then use an appropriate model for that categorical measure of the underlying phenomenon. At worst, using the cruder measure represents a loss of information, but this is often considered acceptable if practical interest and decision making lies on the cruder scale.

 

But, sure, if your primary interest is estimating or testing the predictor effects on the degree of weed pressure, not just its presence or absence, then you could model some continuous measure of it. But note that the gamma distribution is not bounded above and your area measure is since it can't exceed the plot area. And as mentioned, zero is not in the support of the gamma distribution.

Season
Barite | Level 11

I took a glance at the histogram of the raw data and found your data somewhat zero-inflated. To confirm this, I suggest overlaying it with a hypothetical distribution (e.g., Gamma distribution, as you have mentioned).

If zero-inflation is true, then taking it into account is necessary. Try remove the zero part first and see if the remaining portion follows a normal distribution. If it is the case, then things are somewhat simple. Tobit models can do the job. You can use PROC QLIM to build such models.

If that is not the case, then perhaps two-part models or hurdle models should be applied. From my perspective, the so-called two part or hurdle models are in essence finite mixture models and can hence be modeled by PROC FMM.

Another alternative when normality assumption is not tenable for the positive portion is to stick to a latent variable paradigm of zero-inflation modeling which Tobit model adopts. However, I am not that familiar on this topic. I suggest reading a concise yet comprehensive, and therefore excellent review for more on this issue and the ones I briefly mention above if you need more information: Two-Part Models for Zero-Modified Count and Semicontinuous Data | SpringerLink.

wateas
Obsidian | Level 7

I did tinker around with two-part and zero-inflated models a bit.  What I found, and please take this with a grain of salt, as I'm a n00b, but the models fell apart with the factorial structure of my study.  I understand that these types of models require substantial data, and when all my experimental units are parsed by the treatments, there are only 4 replicates in a blocked design.  I did have some success with zero-inflated and hurdle models with a simpler model using a single, continuous predictor (yield).

Season
Barite | Level 11

Yes, models for zero-inflated continuous data are complex indeed. Modeling them requires both sound probability theory and statistical programming knowledge. It is up to you to decide which model to choose. However, I am not very skilled at mixed models. I suggest that you wait @SteveDenham, who has been answering many questions on mixed models in the community and is also in the chat, for answers to the two questions you raised yesterday. Good luck on your project!

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2815 views
  • 8 likes
  • 6 in conversation