SteveDenham Tracker

Re: Assessing Variable Redundancy for Mixed Effects Modeling

SteveDenham — Mon, 23 Jun 2025 19:24:42 GMT

For more on means models, see the first volume of Analysis of Messy Data by Milliken and Johnson. It is just a different way of parameterizing a linear model that is particularly useful for unbalanced datasets. Any of the SAS procedures that allow CLASS statements or implement dummy coding can be used.

As far as a definition for "moderator", Mr. Google offers "In statistics, a moderator variable (or simply a moderator) is a third variable that influences the relationship between two other variables." In this case, it appears that weed cover is a moderator in the relationship between yield and treatment. It is a variable that you don't control in the design. It really works better for continuous variables, but it appears for categorical interactions as well.

SteveDenham

Re: Assessing Variable Redundancy for Mixed Effects Modeling

SteveDenham — Wed, 18 Jun 2025 18:44:31 GMT

What happens when you fit an interaction as in a means model, rather than trying to fit an effects model? Something like

class trt weed_cover;
model yield=trt*weed_cover/solution;
/* RANDOM and REPEATED statements to reflect the study design */
/* LSMESTIMATE statements to get main effects and main effect differences,
 or ESTIMATE statements to accomplish the same ends */

Your PROC FREQ tables should identify any empty cells. If those are present, you may need to collapse the weed_cover categories to get something that may work. Thinking of weed cover as something other than a covariate (in the usual agricultural sense), such as a moderator could help.

SteveDenham

Re: Randomized block design and meaning of LSMEAN/STDERR

SteveDenham — Wed, 04 Jun 2025 16:48:20 GMT

Without going into how to fit an RCBD, I will take a swing at the large standard error that gets reported for treatment A. It comes down to the assumption of homogeneous variances that underlies GLM. GLM uses the pooled root mean squared error to calculate the standard errors, which may not be appropriate in a design where you treat BLOCK as a fixed effect, and the number of observations per block is unequal. Work through the first and third examples for PROC GLM in the SAS documentation. The first is for the analysis of an RCBD, the third is for an unbalanced ANOVA for a two-way design.

SteveDenham

Re: Help with Restricted Cubic Splines : Code Optimization and Graphical Output

SteveDenham — Wed, 04 Jun 2025 16:37:50 GMT

I don't know about "correct" but this certainly looks to be "fit for purpose." When you say optimize, I have to assume you mean "reduce how long it takes to generate results." That is going to depend on the amount of input data you have. As far as programming, I am not a good resource. The one thing I might suggest is using high performance versions of LOGISTIC and GLMSELECT (=HPLOGISTIC and =HPGENSELECT) to see if multi-threading speeds things up.

SteveDenham

Re: PROC POWER for Cox regression

SteveDenham — Wed, 21 May 2025 18:42:17 GMT

One nice feature of PROC POWER is its ability to give power or sample size estimates for a variety of values of the various parameters. In this case, you may want to look at several values for Rsq, to see how sensitive sample size (or power) is to this value. If it turns out that the result is not too sensitive to changes in Rsq, then you can use the standard formula that @Ksharp gave. If it is otherwise, then you need to look at other estimators for Rsq that reduce bias.

SteveDenham

Re: Repeated measures model executes in MIXED but not in GLIMMIX

SteveDenham — Fri, 02 May 2025 14:59:32 GMT

Thanks @JackieJ_SAS !

The MATLAB website (https://www.mathworks.com/help/optim/ug/equation-solving-algorithms.html ) says:

The trust-region-dogleg algorithm is efficient because it requires only one linear solve per iteration (for the computation of the Gauss-Newton step). Additionally, the algorithm can be more robust than using the Gauss-Newton method with a line search.

There is also extensive discussion about trust region methods in general being able to find solutions (e.g, convergence to a minimum) even when the initial starting point isn't near the final solution.

SteveDenham

Re: Model heteroscedasticity directly or use log transformation

SteveDenham — Fri, 02 May 2025 14:48:35 GMT

That is what I get from the GLIMMIX code I posted as well. The standard errors of the by-genotype residual estimates are at least an order of magnitude greater than the point estimates and the chi-squared test for homogeneity is nonsignificant (pr>chisq = 0.2591). The F test for genotype is also non-significant (pr > F = 0.1851). My conclusion is that there is insufficient data to come to any frequentist conclusion about yield as a function of genotype.

So I tried a quick look at the same design, but using the Bayesian approach provided in PROC BGLIMM. the 95% HPD intervals for the means of the genotypes all overlapped, as did the variance 95% HPD intervals. Same conclusions - insufficient data to come to a conclusion about yield as a function of genotype. My code for this was:

proc bglimm data=yield plots=all seed=12321; 
  class genotype rep;
  model yield  = genotype/noint;
  random intercept/subject=rep(genotype) group=genotype;
  run;

No ESTIMATE statements were used to look at pairwise comparisons as the overlapping intervals covered the whole of possible comparisons. Also, no need for multiple comparison adjustment with this approach.

Re: Model heteroscedasticity directly or use log transformation

SteveDenham — Thu, 01 May 2025 15:59:06 GMT

I would model the heterogeneous variances, at least for yield. The QQ plot is nearly straight, with only a bit of curvature at the low end. That looks to me to be acceptable. Just my opinion. For me, the main issue in this case is the nested nature of the random effects in this model. A question, are all genotypes present in each rep, so that this is an RCB design (3 plots with each having all 7 genotypes present). If that is the case, consider this heterogeneous variance approach:

title "Consider REP as a block, model heteroscedasticity due to genotype at REP level";
proc glimmix data=yield plots=residualpanel noprofile;
  ods exclude diffplot linesplot;
  nloptions maxiter=1000;
  *lnyield=log(yield); /* in case you still want to try a log transformation */
  class genotype rep;
  model yield /*lnyield*/ = genotype;
  random intercept/subject=rep group=genotype;
  lsmeans genotype/adj=simulate(seed=111) diff adjdfe=row;
  covtest homogeneity;
  ods output diffs=ppp lsmeans=mmm;
run;

SteveDenham

Re: Repeated measures model executes in MIXED but not in GLIMMIX

SteveDenham — Thu, 01 May 2025 15:11:47 GMT

@wateas - Have you done any sensitivity checking for the PROC MIXED runs? What kind of results do you see if you use a grid search in a PARMS statement? Does it look like the convergence is insensitive to starting values, such that the final -2 log likelihood is the same no matter where you start. I respect wanting to reflect the experimental design in the analysis - I also believe in hierarchical interaction inclusion (e.g. a two way interaction requires the main effects are included in the model, a three-way requires the main effects and the 3 two-way interactions. etc.). In any case, the comments by @jiltao and @JackieJ_SAS both have some approaches that might help.

/* the following was added later, and I don't know if it would help with convergence issues */

One more idea - what about using a spline for the repeated measures variance-covariance structure to handle the irregular spacing? See Example 51.6 (SAS/STAT 15.2 documentation) Radial Smoothing of Repeated Measures Data, which looks at body weights of cows taken at unequally spaced time points. Time is treated as a continuous variable in this case. What I don't see in this example is a way to compare LSMEANS at various time points, but that could possibly be handled with an AT= option in LSMEANS or LSMESTIMATE statements.

SteveDenham

Re: Repeated measures model executes in MIXED but not in GLIMMIX

SteveDenham — Thu, 01 May 2025 14:42:00 GMT

Hi @JackieJ_SAS , inquiring minds want to know where they could find the results to that simulation, please and thank you.🤔

SteveDenham

Re: Repeated measures model executes in MIXED but not in GLIMMIX

SteveDenham — Wed, 30 Apr 2025 18:57:19 GMT

I have a question about your model statement. You include several terms and all of the interactions up to the four-way. Just fitting the main effects would require solving for over 40 parameters (intercept + sum of (levels for each main effect - 1)). I don't want to guess how many are involved for your full model, but I would suspect it is at least 1000. How many observations do you have? The error message implies that you don't have enough data to estimate all of the parameters. I would suggest simplifying the model to include only main effects for a preliminary run, to see if you can get convergence. Then you could add in interactions (the @ option is very handy for this). I am hopeful that you could still get convergence. At the three-way level, you need a lot of observations to have sufficient power to detect differences, and in most cases I have worked with, a four-way interaction is indistinguishable from random noise. So - simplify and see what occurs.

SteveDenham

Re: How to do nested counts

SteveDenham — Thu, 24 Apr 2025 18:52:49 GMT

I know this has been answered, but wouldn't a PROC MEANS (or SUMMARY) give all of the levels needed for this? It would then just be a sorting task to get something to print out. I have to admit the PROC REPORT approach more quickly yields a more esthetically pleasing output, but if the dataset is really large, with a lot of nesting, and you need to use the results for additional work, it might be worth taking a look.

SteveDenham

Re: WARNING: Ridging has failed to improve the loglikelihood.

SteveDenham — Thu, 24 Apr 2025 18:44:23 GMT

What does running a proc freq of this sort tell you:

proc freq data=
tables harvest*wks*variety*PercStemEndRot/cmh;
run;

The CMH option should give you a test for association of variety with PercStemEndRot, after adjusting for harvest and wks. In addition, it should let you know where the zeroes are in your data. Consolidating categories is probably the best way to handle this.

SteveDenham

(I can't believe I am not offering some sort of exact approach to a generalized linear model, but I think this has two advantages - you will know where the zeroes are, and I believe you will still get some useful inferential information).

Re: Distribution for percentages in proc genmod

SteveDenham — Thu, 17 Apr 2025 17:02:55 GMT

Just for fun, consider a modeling approach that doesn't assume homogeneity of variance. Working from your GLIMMIX code, try something like:

/* Changes are in red  */
proc glimmix data=one;
where Season=2021;
PercDMp=PercDM/100;
class Harvest Variety;
model PercDMp=Harvest*Variety/ dist=beta ddfm=kr2;
random _residual_ / group=Variety;
lsmeans Harvest*Variety/slicediff=Harvest adjust=simulate(seed=1);
covtest 'common variance' homogeneity; 
run;

If it turns out that the likelihood ratio test for variance homogeneity for Variety is not significant, try it again grouping by Harvest. I really don't know if either will affect your conclusions, but at least you have dealt with a common assumption (homogeneity) that may not be true for your data.

The other thing to look at is to try a different method than the default RSPL. Consider method=laplace, so that the error variance component is included in the optimization. That may yield standard errors that are more in line with expectations.

SteveDenham

Re: Distribution for percentages in proc genmod

SteveDenham — Wed, 16 Apr 2025 17:27:18 GMT

Look up leaf area index (LAI) and see what methods have been used for analyzing that endpoint. LAI is the proportion (or percentage/100) of the area of a given plot or transect that is covered by at least one leaf when viewed perpendicular to the ground. It is defined on the interval (0,1), bounded away from zero and one. When I last looked at the analyses that various folks used, there were a lot of options. Some have been mentioned here (beta regression, fractional logistic regression), but I am going to throw my support to some sort of resampling with replacement. Judging from the histogram you have a lot of observations, so taking samples of a relative size of (for example) total plots/20 and generating 5000 samples should not be difficult. From that, you can appeal to the central limit theorem to get means and confidence intervals. This might be more appropriate for your long right tail and non-unimodal data, which really looks like a mixture of two distributions to me.

No guarantees, no warranty implied.

SteveDenham

Re: Appropriate model for non-normal distribution

SteveDenham — Mon, 31 Mar 2025 15:23:42 GMT

You need to add an NLOPTIONS statement to your code. The log likelihood is converging, but it is running into the default maximum number of iterations (=19).

Try adding:

nloptions maxiter=500;

to your GLIMMIX code. Alternatively, you might wish to set the ABSGCONV to 1e-6 (or something like that), as it looks like the max gradient in the iteration history is cycling around at values less than this.

Also, with a non-normal distribution, you may want to change from method=rspl to method=quad (or method=laplace if there are issues with the number of points for adaptive quadrature). However, if you do change to method=quad, you will need to change the RANDOM statement to:

random intercept/subject=block;

as the quadrature method requires that the data be processed by subject. I think you are right for the rest with the gamma.

Regarding plot size, does that put an upper bound on the dependent variable? If the areas being analyzed substantially less than that upper bound then there shouldn't be an issue, but it could possibly result in things like an lsmean being larger than any of the plots if you fit a gamma and a substantial portion of the dependent variables are greater than one-half of the maximum plot size..

In that case, consider fitting a four parameter logistic model, with a random effect, using PROC NLMIXED. There are examples out on the interwebs for that approach. Here is something really simple:

proc nlmixed data=mydata;
  model response = a + (b-a)/(1+exp(c*(x-d))) / dist=normal;
  parms a=0.0 b=1.0 c=1.0 d=0.0; /*replace b=1.0 with b=<max plot size> */
  random intercept / subject=block;
run;

Just something to consider. If there are treatments applied like you have, the code gets a whole boatload more complicated, but there are examples out there on how to incorporate those.

SteveDenham

PS - the arcsine transformed data is an approximation of the logistic model, so that may be why it is looking appropriate.

Re: What test should I use?

SteveDenham — Mon, 31 Mar 2025 14:22:22 GMT

I would make a couple minor changes to @Ksharp 's PROC MIXED code, in case there is a difference over time for the two sexes:

proc mixed data=have;
class sex semester;
model grade=sex semester sex*semester/ddfm=kr2 s;
repeated semester/ subject=id type=ar(1);
lsmeans sex semester/diff e;
lsmeans sex*semester/diff e; /* This should probably be modified to look at the simple effect of sex for each semester, and the simple effect of semester for each sex by using the SLICE option */
run;

There is at least one other thing to consider as well - should separate variance-covariance estimates be applied by sex, to handle any differences (non-homogeneity). If that is the case, you may need to change to PROC GLIMMIX to check on that.

SteveDenham

Re: Appropriate model for non-normal distribution

SteveDenham — Mon, 31 Mar 2025 14:09:53 GMT

I think you are doing well so far. Here are some points to consider, that I can't determine from the presentation:

Design-wise, are the plots identical in area? If not, then the binomial distribution referred to later on may not be appropriate. That may require something like a beta distribution.

Although the quantile plot you present doesn't seem too bad, your data may be zero-inflated or a hurdle model might be appropriate. Before you go down that path though, you need to think about what process could lead to excess zeroes. Stroup's text (Generalized Linear Mixed Models, 2013) has a section that uses NLMIXED to fit excess zeroes for count data, and the code could be modified to fit distributions other than Poisson or negative binomial.

Can you share your GLIMMIX code, and the iteration history? There may be some easy tweaks to enable the model to converge.

SteveDenham

Re: Modeling zero-censored semi-continuous data with PROC SEVERITY

SteveDenham — Thu, 20 Mar 2025 17:27:07 GMT

So given the definition of left censoring that PROC SEVERITY uses, your response value could potentially be negative. Zero and negative values aren't supported by several of the interesting distributions available to you in SEVERITY. Would those values be meaningful, or even observable? (I only ask as I don't know what the response variable is). If the variables are not observable, then consider that the left truncation approach has some appeal. You can set the truncation value at a small non-zero value, and all of the estimates are correctly determined. The issue becomes what is the small value to use. I think a good way to choose would be to see to how many decimal places the response is measured, and then set the truncation at half that value. For example, suppose you measure the response to the nearest thousandth (=Y.YYY). Under this scheme, the truncation value of 0.0005 would guarantee that it is greater than zero, and that all observed values are included.

Or am I still missing the point here?

SteveDenham

Re: Modeling zero-censored semi-continuous data with PROC SEVERITY

SteveDenham — Wed, 19 Mar 2025 15:25:19 GMT

Is your data left censored or left truncated? The way I read the documentation, left truncation means the result is observed only if Y > T where T is the truncation threshold. Then the documentation defines left censoring if it is known that the magnitude is Y<= C. That may have some effect on the CDF estimates. I suspect that the use of a small value for the cutpoint may then have a different effect, especially for the candidate distributions that are not defined for Y=0. I would be tempted to add a small value to all the observations, and then set the cutoff at that value, just to see what happens.

SteveDenham