BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
wateas
Obsidian | Level 7

I'm analyzing data from a designed agricultural experiment using linear mixed models.  At one point, I collected data for what I thought would serve a covariate (weed rating) for explaining crop yield, but further analysis revealed overlap between this variable and one of my experimental treatments.  A major hint that these variables might be collinear was that, when both variables were included in the model as predictors for crop yield, the treatment variable was no longer significant (it previously had been consistently significant).  Note that both variables are categorical, and the treatment variable has two levels while the covariate (weed rating) is an ordered variable with 5 levels (1 to 5). 

 

I would like to determine a standard way to demonstrate redundancy between these variables, so that I can justify not including them simultaneously in models.   A chi-squared test (code below) seems to be appropriate.  Is this approach adequate for establishing predictor redundancy?  I'm not using the term "collinearity" here, because I'm under the impression that collinearity applies more to regression and not ANOVA.  I don't think I need to determine VIF here.

title "Chi-squared test for predictor redundancy";
proc freq data=df;
    tables  treatment * other_variable / chisq;
run;

Ultimately, my goal is to establish a causal link between the response (crop yield), the treatment, and weed rating.  I've accomplished part of this by treating the covariate (weed rating) variable as the response variable and modeling that with my experimental treatments.  That confirmed a high degree of correlation (not sure if "correlation" is technically correct to use here, but you know what I mean) between the treatment in question and the weed rating.   Now, I believe that I just need to establish the link between weed rating and yield, and justify not including the redundant treatment in the model.  Thank you for reading.

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

A discussion of this issue: https://blogs.sas.com/content/iml/2020/01/23/collinearity-regression-collin-option.html

 

You can convert your categorical variables to dummy variables and then apply the PROC REG COLLIN option as well as the PROC REG VIF option.

--
Paige Miller

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

A discussion of this issue: https://blogs.sas.com/content/iml/2020/01/23/collinearity-regression-collin-option.html

 

You can convert your categorical variables to dummy variables and then apply the PROC REG COLLIN option as well as the PROC REG VIF option.

--
Paige Miller
wateas
Obsidian | Level 7

Thanks for your response.  I was able to generate dummy variables for all of my experimental treatments and the weed rating.  I'm finding the output to be difficult to interpret though.  Might you have any idea of what might be going on from the output?20250617 - SAS screenshot.png

SteveDenham
Jade | Level 19

What happens when you fit an interaction as in a means model, rather than trying to fit an effects model?  Something like

class trt weed_cover;
model yield=trt*weed_cover/solution;
/* RANDOM and REPEATED statements to reflect the study design */
/* LSMESTIMATE statements to get main effects and main effect differences,
or ESTIMATE statements to accomplish the same ends */

Your PROC FREQ tables should identify any empty cells. If those are present, you may need to collapse the weed_cover categories to get something that may work. Thinking of weed cover as something other than a covariate (in the usual agricultural sense), such as a moderator could help.

 

SteveDenham

wateas
Obsidian | Level 7

There is one missing cell, with no weed ratings of "5" for the "no_CC" level of Trt_CC, so I might need to collapse the levels as you said.

 

I'm not really sure what you mean by a means vs. an effects model.  What procedure would I use for a means model?

SteveDenham
Jade | Level 19

For more on means models, see the first volume of Analysis of Messy Data by Milliken and Johnson. It is just a different way of parameterizing a linear model that is particularly useful for unbalanced datasets. Any of the SAS procedures that allow CLASS statements or implement dummy coding can be used.

 

As far as a definition for "moderator",  Mr. Google offers "In statistics, a moderator variable (or simply a moderator) is a third variable that influences the relationship between two other variables." In this case, it appears that weed cover is a moderator in the relationship between yield and treatment. It is a variable that you don't control in the design. It really works better for continuous variables, but it appears for categorical interactions as well.

 

SteveDenham

PaigeMiller
Diamond | Level 26

When you use dummy variables to replace a categorical variable, and there are n levels for this categorical variable, you want to use n-1 dummy variables. So what happens if you re-run the analysis and take col10 and col15 out of this analysis?

--
Paige Miller
wateas
Obsidian | Level 7

Part of the issue, I think, is the default designation of the reference level.  For the factor "Trt_CC", the reference level being "no_CC" is good, but for "Weed_Rating", the reference level probably shouldn't be 5, because that is the rating for the most weedy plots and is exerting the most influence.  I'll figure out how to set the reference level, and I think I might already have an idea of how to exclude it from the dummy variable.  BRB

PaigeMiller
Diamond | Level 26

@wateas wrote:

Part of the issue, I think, is the default designation of the reference level.  For the factor "Trt_CC", the reference level being "no_CC" is good, but for "Weed_Rating", the reference level probably shouldn't be 5, because that is the rating for the most weedy plots and is exerting the most influence.  I'll figure out how to set the reference level, and I think I might already have an idea of how to exclude it from the dummy variable.  BRB


I disagree completely, reference level has no impact here.

--
Paige Miller
wateas
Obsidian | Level 7

Ok, so the results look better with the respective reference levels of each variable excluded from the procedure.

 

/* Remove column 10 and 15 (automatically designated as reference levels by software) */
proc reg data=df_dummy plots=none;
   model Yield_Total_Mg_ha =  col9 col11 col12 col13 col14 / collin;
   ods select ParameterEstimates CollinDiag;
   ods output CollinDiag = CollinReg;
quit;

The results look a lot better.  However, I'm somewhat surprised at the low condition index for column 6 (weed rating 4).  The reason I believe there is collinearity is because when I include both terms ("Trt_CC" and Weed Rating) in the model, neither are significant, but included separately they are significant.

20250619 - SAS screenshot 2.png

PaigeMiller
Diamond | Level 26

The last eigenvalue indicates the collinearity problem. Most of the variability of the intercept and col12 are explained by the last eigenvalue. This probably is cause by col12 being almost all zeros (or all ones), meaning they are highly correlated. In additon, there are other columns with values > 0.7 in the last row, these two other columns indicate most of the variability is in the last eigenvalue, but not as much as col12.

--
Paige Miller
wateas
Obsidian | Level 7

I'm having difficulty understanding your explanation and deciphering the output overall. I appreciate the blog link you previously sent with explanation about the Condition Index, but any further documentation to help understand the other metrics / indices would be much appreciated. Thank you.

 

I can see that the proportion of variability is often > 0.7 in the last row (row 6) but what throws me off is that this only the case for the intercept column and columns 11-14.  Columns 11-14 represent the dummy variables for weed rating scores of 1, 2, 3, and 4 (column 15, or dummy var for weed rating of 5, was removed).  What does it mean for one level of a predictor to explain the variation in another level of the same predictor?  Also, the proportion of variation is low where row 6 and column 9 intersect, where column 9 is one of the levels of the treatment that I believe is collinear with the weed rating.  So, as far as I can understand the output, it would appear that there is collinearity only between levels of the weed rating and not between the two factors.  I don't know what to do with that, and I feel like my understanding must be incorrect.  I appreciate any clarification that you can provide here.

Ksharp
Super User

Whatever your Y variable is continuous or binary , you could use CORRB to check the the correlation between any two estimated coefficient.

 

proc genmod data=sashelp.heart;
class sex;
model height=weight sex ageatstart/corrb ;
quit;

Ksharp_0-1750466727252.png

 

 


proc genmod data=sashelp.heart;
class sex;
model status(event='Dead')=height weight sex ageatstart/corrb ;
quit;

Ksharp_1-1750466776230.png

 

 

And for Mixed Model you can check COV matrix by COVTEST statement of GLIMMIX:

https://support.sas.com/kb/40/724.html

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1222 views
  • 6 likes
  • 4 in conversation