SAS Support Communities

StatDave

There is no problem using the binomial distribution if your interest is in testing the effect of your predictors, whatever they are, on the probability of weed pressure existing. In that case, the response is fundamentally binary - each plot is either considered to be affected by weed pressure or not based on your height threshold. It is common for investigators to observe some continuously-valued phenomenon but only be able to measure it reliably at a binary or ordinal level and to then use an appropriate model for that categorical measure of the underlying phenomenon. At worst, using the cruder measure represents a loss of information, but this is often considered acceptable if practical interest and decision making lies on the cruder scale. But, sure, if your primary interest is estimating or testing the predictor effects on the degree of weed pressure, not just its presence or absence, then you could model some continuous measure of it. But note that the gamma distribution is not bounded above and your area measure is since it can't exceed the plot area. And as mentioned, zero is not in the support of the gamma distribution.

StatDave

If you know the number of square feet in each plot and you measure the number of square feet with weeds above the required height, then the ratio is just a binomial proportion that you could model with an ordinary logistic model. No need to do any transformation. With a data set containing one observation per plot and with a variable containing the number of affected square feet in the plot and another with the total number of square feet in the plot, you could use the events/trials response syntax in PROC LOGISTIC to fit the model. model Naffected/Ntotal = ... ; You could include your BLOCK variable in the model if you have blocks of plots. Or, if you really want to use a random effect, then you could use the same model syntax, with DIST=BINOMIAL, in PROC GLIMMIX.

StatDave

Since you have repeated values for each ID, you could consider fitting a Generalized Estimating Equations model using PROC GEE. It is tolerant of the missing values. You will need to decide what distribution your response (GRADE) has. You would need to use the long format of the data, like in your original post, with one observation for each individual grade so that all grades are in a single variable, GRADE. For example, assuming that you use the normal distribution, these statements would fit the model: proc gee; class sex; model grade=sex / dist=normal link=identity; repeated subject=id; run; The test for the SEX effect is a test of whether the sexes differ with respect to grades.

StatDave

Also, for the ordinary linear regression model, you can use the HETERO statement in PROC QLIM or PROC HPQLIM in SAS/ETS to separately model the variance from the mean model. Various link functions for the variance can be specified. Also available in SAS Viya in PROC CQLIM. See the discussion of heterogeneity in the Details section of the procedure documentation.

StatDave

If you consider the plots as the units being measured, not the individual plants, and you are simply counting the number of emergent plants in each plot at each time point, then this sounds like repeated measures count data that could be modeled with, for instance, a GEE model. In PROC GEE you would specify an appropriate response distribution like Poisson or negative binomial in the DIST= option in the MODEL statement. In the REPEATED, specify the plot identifier in the SUBJECT= option in the REPEATED statement and it never hurts to also specify the time point variable in the WITHIN= option.

StatDave

Practically speaking, it might be unnecessary to use the exact method. For one thing, the usual rule of thumb of the chi-square test not being valid with many expected (not observed) cell counts less than 5 is considered by many statisticians as overly conservative. Further, Stokes, Davis, and Koch (2012, Categorical Data Analysis Using the SAS System, Third Edition) state that the exact method usually produces more conservative results and recommend using the exact method only when sample sizes are small and the p-values from the usual (asymptotic) tests are less than 0.10. If the usual p-values are larger than 0.15, they suggest that the exact results are likely to be about the same. That being said, the decision is yours and the Monte Carlo approximation is at least worth a try.

StatDave

The two methods are very different. The Firth method is still an iterative maximum likelihood estimation method, with just a small tweak to add a penalty to the likelihood function. See the details in the "Details: Iterative Algorithms for Model Fitting" section of the LOGISTIC documentation. The exact method makes use of conditional methods through the generation of a conditional distribution and likelihood function. See the "Details: Exact conditional logistic regression". The exact method can be very computationally intensive and is generally feasible only with smaller, simpler data sets. But practically speaking, yes, they are both methods that are frequently used to deal with the problems occurring with sparse data. And the Firth method is less computationally challenging so is often more feasible. However, as with all iterative methods, both of these methods can fail depending on the data and model.

StatDave

As discussed in this note, you cannot use the ESTIMATE statement to obtain estimated means from zero-inflated models fit by PROC GENMOD. See the methods shown in the note to estimate means under the model. You can search the SAS Notes and Samples at https://support.sas.com/en/knowledge-base.html and the list of Frequently-Asked for Statistics at https://support.sas.com/kb/30/333.html .

StatDave

Assuming that your data set is small so that exact methods are feasible, try using the EXACT statement to see if you can get a better estimate of the odds ratio. Add this statement: exact treatment / estimate=both;

StatDave · ‎02-17-2025

See this note that shows ways the difference in difference (DID) analysis can be done with binary response data.

StatDave · ‎02-17-2025

Alternate methods for estimating and testing the risk difference in matched pairs data are discussed and illustrated in this note. In addition to the model-based approach using macros with a GEE model without the identity link, the note also discusses a non-model-based approach using the COMMONRISKDIFF option in PROC FREQ. Similarly for non-clustered data, the methods are discussed in this note. The examples do not use a GEE model but the NLMeans, NLEST, or Margins macros can all be used with a GEE model as shown in the above note. The NLMIXED procedure could also be used by adding a random effect for the clusters, though this is not a GEE model. Both notes also mentions the potential problem of using the identity link.

StatDave · ‎02-17-2025

Regarding the concern over how the model is fit using the identity link: the GEE estimation algorithm applies regardless of the distribution and link function. This algorithm is shown in the "Details: Generalized Estimating Equations: Fitting Algorithm" section of the GENMOD documentation. There is no requirement to use the canonical link function (which is the logit link for the binomial distribution). However, it is certainly true that the fitting algorithm could fail when the identity link is used with the binomial distribution because this link function does not assure that the fitting values are valid probabilities as expected for a binomial response. When using the identity link with the binomial distribution, it is therefore important to examine the fit to be sure that proper convergence was obtained. Even if no errors are issued, there could be signs of improper convergence such as gradient values not close to zero (use the ITPRINT option) or large parameter standard errors which should be quite small for binomial models.

StatDave · ‎02-17-2025

As shown in the GENMOD documentation of the LSMEANS statement, the standard error estimate is sqrt(LV(β)L') where V is the estimated variance-covariance matrix of the model parameters and L is the hypothesis matrix which is simply the vector (1 -1) in this case. You could obtain the same result using the ESTIMATE statement. The GENMOD documentation of the ESTIMATE statement shows the same form of the standard error estimate and notes that, for a GEE model, V(β) (written there as Σ) is the empirical estimate of the covariance matrix. The formula for the empirical estimator is shown in the GENMOD documentation in the "Details: Generalized Estimating Equations: Parameter Estimate Covariances" section.

StatDave · ‎02-14-2025

With the nonpositional syntax, both the reference level and the sorted order of the CLASS variable levels matter. Note that the order of the CLASS levels can be changed with the ORDER= option in the CLASS statement. So, syntax like sex [1,2] says that you want to multiple 1 by the parameter associated with the 2nd ordered level of SEX. If the default ORDER= applies (as it does here), the sorted order of SEX is F then M (or 1 then 2 with your recoding). So the 2nd ordered level of SEX is M (or 2). Since the reference level for SEX is F (or 2), it has a parameter of 0 (the parameter for M (or 1) is the estimated parameter). If you change the ordering of the levels or the reference level, then what gets multiplied by 1 might change.

StatDave · ‎02-13-2025

Nothing changes for either the positional or the nonpositional syntax as long as you retain the Female level (whatever its coding) as the reference level.

Online Status	Offline
Date Last Visited	a week ago

SAS Support Communities

Re: Appropriate model for non-normal distribution

Re: Appropriate model for non-normal distribution

Re: What test should I use?

Re: Joint mean and variance modeling in SAS: any suggestions on which ...

Re: Count data over years and repeated measures

Re: Fisher's - Taking long to run - 4x4 table with greater than 1000 s...

Re: Extreme value of OR and HR and CI with FIRTH option in proc logis...

Re: Need help for estimated statement: Zero-inflated count data

Re: Extreme value of OR and HR and CI with FIRTH option in proc logis...

Re: Difference-in-difference with categorical outcomes

Re: Model for Correlated data

Re: Appropriate model for non-normal distribution

Re: Appropriate model for non-normal distribution

Re: Joint mean and variance modeling in SAS: any suggestions on which ...

Re: Count data over years and repeated measures

Re: Fisher's - Taking long to run - 4x4 table with greater than 1000 s...

Re: Appropriate model for non-normal distribution

Re: Appropriate model for non-normal distribution

Re: What test should I use?

Re: Joint mean and variance modeling in SAS: any suggestions on which ...

Re: Count data over years and repeated measures

Re: Fisher's - Taking long to run - 4x4 table with greater than 1000 s...

Re: Extreme value of OR and HR and CI with FIRTH option in proc logis...

Re: Need help for estimated statement: Zero-inflated count data

Re: Extreme value of OR and HR and CI with FIRTH option in proc logis...

Re: Difference-in-difference with categorical outcomes

Re: Mathematical formula for the variance of the difference in estimat...

Re: Mathematical formula for the variance of the difference in estimat...

Re: Mathematical formula for the variance of the difference in estimat...

Re: Writing contrast using spline & a binary variable

Re: Writing contrast using spline & a binary variable

Follow Us

What is...