Solved: Assessing Variable Redundancy for Mixed Effects Modeling

wateas · Posted 06-17-2025 12:04 PM

I'm analyzing data from a designed agricultural experiment using linear mixed models. At one point, I collected data for what I thought would serve a covariate (weed rating) for explaining crop yield, but further analysis revealed overlap between this variable and one of my experimental treatments. A major hint that these variables might be collinear was that, when both variables were included in the model as predictors for crop yield, the treatment variable was no longer significant (it previously had been consistently significant). Note that both variables are categorical, and the treatment variable has two levels while the covariate (weed rating) is an ordered variable with 5 levels (1 to 5).

I would like to determine a standard way to demonstrate redundancy between these variables, so that I can justify not including them simultaneously in models. A chi-squared test (code below) seems to be appropriate. Is this approach adequate for establishing predictor redundancy? I'm not using the term "collinearity" here, because I'm under the impression that collinearity applies more to regression and not ANOVA. I don't think I need to determine VIF here.

title "Chi-squared test for predictor redundancy";
proc freq data=df;
    tables  treatment * other_variable / chisq;
run;

Ultimately, my goal is to establish a causal link between the response (crop yield), the treatment, and weed rating. I've accomplished part of this by treating the covariate (weed rating) variable as the response variable and modeling that with my experimental treatments. That confirmed a high degree of correlation (not sure if "correlation" is technically correct to use here, but you know what I mean) between the treatment in question and the weed rating. Now, I believe that I just need to establish the link between weed rating and yield, and justify not including the redundant treatment in the model. Thank you for reading.

PaigeMiller · Posted 06-17-2025 01:23 PM

A discussion of this issue: https://blogs.sas.com/content/iml/2020/01/23/collinearity-regression-collin-option.html

You can convert your categorical variables to dummy variables and then apply the PROC REG COLLIN option as well as the PROC REG VIF option.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 06-17-2025 01:23 PM

A discussion of this issue: https://blogs.sas.com/content/iml/2020/01/23/collinearity-regression-collin-option.html

You can convert your categorical variables to dummy variables and then apply the PROC REG COLLIN option as well as the PROC REG VIF option.

--
Paige Miller

wateas · Posted 06-17-2025 11:24 PM

Thanks for your response. I was able to generate dummy variables for all of my experimental treatments and the weed rating. I'm finding the output to be difficult to interpret though. Might you have any idea of what might be going on from the output?

SteveDenham · Posted 06-18-2025 02:44 PM

What happens when you fit an interaction as in a means model, rather than trying to fit an effects model? Something like

class trt weed_cover;
model yield=trt*weed_cover/solution;
/* RANDOM and REPEATED statements to reflect the study design */
/* LSMESTIMATE statements to get main effects and main effect differences,
 or ESTIMATE statements to accomplish the same ends */

Your PROC FREQ tables should identify any empty cells. If those are present, you may need to collapse the weed_cover categories to get something that may work. Thinking of weed cover as something other than a covariate (in the usual agricultural sense), such as a moderator could help.

SteveDenham

wateas · Posted 06-19-2025 10:31 AM

There is one missing cell, with no weed ratings of "5" for the "no_CC" level of Trt_CC, so I might need to collapse the levels as you said.

I'm not really sure what you mean by a means vs. an effects model. What procedure would I use for a means model?

SteveDenham · Posted 06-23-2025 03:23 PM

For more on means models, see the first volume of Analysis of Messy Data by Milliken and Johnson. It is just a different way of parameterizing a linear model that is particularly useful for unbalanced datasets. Any of the SAS procedures that allow CLASS statements or implement dummy coding can be used.

As far as a definition for "moderator", Mr. Google offers "In statistics, a moderator variable (or simply a moderator) is a third variable that influences the relationship between two other variables." In this case, it appears that weed cover is a moderator in the relationship between yield and treatment. It is a variable that you don't control in the design. It really works better for continuous variables, but it appears for categorical interactions as well.

SteveDenham

PaigeMiller · Posted 06-19-2025 06:51 AM

When you use dummy variables to replace a categorical variable, and there are n levels for this categorical variable, you want to use n-1 dummy variables. So what happens if you re-run the analysis and take col10 and col15 out of this analysis?

--
Paige Miller

wateas · Posted 06-19-2025 10:21 AM

Part of the issue, I think, is the default designation of the reference level. For the factor "Trt_CC", the reference level being "no_CC" is good, but for "Weed_Rating", the reference level probably shouldn't be 5, because that is the rating for the most weedy plots and is exerting the most influence. I'll figure out how to set the reference level, and I think I might already have an idea of how to exclude it from the dummy variable. BRB

PaigeMiller · Posted 06-19-2025 10:34 AM

@wateas wrote:

Part of the issue, I think, is the default designation of the reference level. For the factor "Trt_CC", the reference level being "no_CC" is good, but for "Weed_Rating", the reference level probably shouldn't be 5, because that is the rating for the most weedy plots and is exerting the most influence. I'll figure out how to set the reference level, and I think I might already have an idea of how to exclude it from the dummy variable. BRB

I disagree completely, reference level has no impact here.

--
Paige Miller

wateas · Posted 06-19-2025 10:54 AM

Ok, so the results look better with the respective reference levels of each variable excluded from the procedure.

/* Remove column 10 and 15 (automatically designated as reference levels by software) */
proc reg data=df_dummy plots=none;
   model Yield_Total_Mg_ha =  col9 col11 col12 col13 col14 / collin;
   ods select ParameterEstimates CollinDiag;
   ods output CollinDiag = CollinReg;
quit;

The results look a lot better. However, I'm somewhat surprised at the low condition index for column 6 (weed rating 4). The reason I believe there is collinearity is because when I include both terms ("Trt_CC" and Weed Rating) in the model, neither are significant, but included separately they are significant.

PaigeMiller · Posted 06-19-2025 02:22 PM

The last eigenvalue indicates the collinearity problem. Most of the variability of the intercept and col12 are explained by the last eigenvalue. This probably is cause by col12 being almost all zeros (or all ones), meaning they are highly correlated. In additon, there are other columns with values > 0.7 in the last row, these two other columns indicate most of the variability is in the last eigenvalue, but not as much as col12.

--
Paige Miller

wateas · Posted 06-20-2025 03:43 PM

I'm having difficulty understanding your explanation and deciphering the output overall. I appreciate the blog link you previously sent with explanation about the Condition Index, but any further documentation to help understand the other metrics / indices would be much appreciated. Thank you.

I can see that the proportion of variability is often > 0.7 in the last row (row 6) but what throws me off is that this only the case for the intercept column and columns 11-14. Columns 11-14 represent the dummy variables for weed rating scores of 1, 2, 3, and 4 (column 15, or dummy var for weed rating of 5, was removed). What does it mean for one level of a predictor to explain the variation in another level of the same predictor? Also, the proportion of variation is low where row 6 and column 9 intersect, where column 9 is one of the levels of the treatment that I believe is collinear with the weed rating. So, as far as I can understand the output, it would appear that there is collinearity only between levels of the weed rating and not between the two factors. I don't know what to do with that, and I feel like my understanding must be incorrect. I appreciate any clarification that you can provide here.

Ksharp · Posted 06-20-2025 08:47 PM

Whatever your Y variable is continuous or binary , you could use CORRB to check the the correlation between any two estimated coefficient.

proc genmod data=sashelp.heart;
class sex;
model height=weight sex ageatstart/corrb ;
quit;


proc genmod data=sashelp.heart;
class sex;
model status(event='Dead')=height weight sex ageatstart/corrb ;
quit;

And for Mixed Model you can check COV matrix by COVTEST statement of GLIMMIX:

https://support.sas.com/kb/40/724.html

Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Re: Assessing Variable Redundancy for Mixed Effects Modeling

Registration is open