Re: CATMOD for ANOVA

palolix · Posted 08-21-2024 05:20 PM

Dear all,

Im trying to use CATMOD for an ANOVA to investigate the relationship between the dependent var (PeelColor) which is nominal (scale from 1 to 6) and the independent variables (harvest, variety, weeks). So I would like to get the main effects of the 3 independent variables and their interaction. I attached an excel file with my data.

This is the code Im runing;

proc catmod data=one;

response mean;

model PeelColor=harvest|variety|weeks;

quit;

I have two questions:

1. How can I request a type 3 table in the model so I can see if the main effects and interactions are significant?

2. Do I need to include a weight statament in the model? In that case, what whould be my weight var?

I would greatly appreciate your help!

Thank you

Caroline

PaigeMiller · Posted 08-21-2024 05:46 PM

Most of use refuse to download Excel (or any other MS Office file) as they can be security threats. The proper way to provide data is shown here.

Why are you using CATMOD for ANOVA? Why not use PROC GLM? I don't really know if CATMOD produces Type III analyses, I don't see that in the documentation. Certainly, GLM produces type 3 analyses.

The structure of your input data determines if you need a WEIGHT statement. If each row can represent more than one observation (because it is summary data) then you need a WEIGHT statement.

--
Paige Miller

palolix · Posted 08-21-2024 07:10 PM

Thanks for your quick reply Page. Im using catmod because my data is not normal, and it is nominal (scale from 1 to 6). So if I only have fixed effects should I then use glm instead?

Thank you!

PaigeMiller · Posted 08-22-2024 07:25 AM

@palolix wrote:
Thanks for your quick reply Page. Im using catmod because my data is not normal, and it is nominal (scale from 1 to 6). So if I only have fixed effects should I then use glm instead?

Thank you!

There is no requirement that the data be normal to perform ANOVA. The requirement is that the errors are normal. Nevertheless, I think @StatDave has provided a good answer.

--
Paige Miller

palolix · Posted 08-25-2024 08:01 PM

Thank you Paige for your feedback, I think I will follow the suggestion of StateDave.

Thanks

Caroline

StatDave · Posted 08-21-2024 06:06 PM

PROC CATMOD is not the best procedure to use, and in any case, the Mean response is not the proper response function to model a nominal, multinomial response. Use a more modern procedure such as PROC LOGISTIC with the LINK=GLOGIT option in the MODEL statement. A WEIGHT statement is not needed. An example is the one titled "Nominal Response Data: Generalized Logits Model" in the Examples section of the LOGISTIC documentation. For your case, the following statements will attempt fit the appropriate generalized logit model and provide a table of Type3 tests of the model effects. The following code assumes that your WEEKS variable is actually a continuous variable while HARVEST and VARIETY are categorical. Note that with a six-level response, five independent generalized logits are modeled simultaneously so you will see five intercepts and five times the number of other parameters and degrees of freedom as you might normally expect with a continuous or binary response. Because of the large number of parameters in this model, your data might be too sparse to avoid model fitting problems as a result of some parameters being infinite.

proc logistic data=one;
class harvest variety / param=glm;
model PeelColor=harvest|variety|weeks / link=glogit;
run;

palolix · Posted 08-25-2024 08:09 PM

Thank you so much for your great support StateDave! I liked your suggestion, just wanted to ask you why do you consider the variable Weeks as continuous? I would consider it categorical (ordinal) since it can only take values of 1, 3, or 6 weeks (number of weeks that the fruits were storaged in a coldroom).

I tried the code you suggested and I got results for the var PeelColor, although I got this warning: matrix is singular and thus the convergence is questionable.

I would greatly appreciate your feedback on this.

Thank you!

Caroline

StatDave · Posted 08-26-2024 01:20 PM

The data you show actually has 8 response levels and if you try to fit the 3 factor model with all factors categorical and include all the interactions, the number of parameters to be estimated is very large and is way, way more than can be supported by the data. That model makes the data far too sparse and PROC LOGISTIC will report "separation" in a log note indicating that the model cannot properly converge because some parameters are infinite. Yes, you can always treat a predictor as categorical, but for a variable that has a small number of meaningful numeric values, you could also treat it as continuous. Doing so results in only one parameter to be estimated (per logit) rather than several. But doing that alone is not nearly enough to avoid the sparseness and nonconvergence. If you reduce the response down to 3 levels and fit only the main effects model, then that model can be properly fit. For example,

data one; set one; 
y=int(peelcolor); if y=5 then y=4; 
run;
proc logistic data=one;
class harvest variety/param=glm;
model y=harvest variety weeks / link=glogit;
run;

palolix · Posted 08-26-2024 02:17 PM

Thank you so much for your support StatDave.

I tried the new model you suggested:

data one; set one;
PeelColor=int(PeelColor); if PeelColor=5 then PeelColor=4;
run;
proc logistic data=one;
class Harvest Variety/param=glm;
model PeelColor=Harvest Variety Wks/ link=glogit ;
run;

but I get these two warnings (for most of my dep variables):

WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood
estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based
on the last maximum likelihood iteration. Validity of the model fit is questionable.

I have two concerns with this new model:

1. For the statement PeelColor=int(PeelColor); if PeelColor=5 then PeelColor=4; This dep var goes from 1 to 6, so Im not sure about this.

2. I understand the need of symplifying the model and thus just include the main factor effects, but what if I really need to know the ffects of the interactions of harvest*variety and variety*weeks on the dep var?

Thank you very much!

StatDave · Posted 08-26-2024 05:57 PM

Then you are not using the data that you attached in your first post. With those data, there are no warnings with the code you show. In the data you attached, there are 8 distinct response levels, and fitting the full model on all those levels with that small amount of data is not remotely feasible. So, you have to consider what you CAN do. If you combine the response levels down to three levels as done by the DATA step I gave, then you can fit the main effects model and you can even treat WEEK as categorical if you want. And the results show that WEEK is not significant, VARIETY is significant, and HARVEST is marginally significant. If you are willing to drop WEEK from the model, then you can fit the model with VARIETY and HARVEST and their interaction IF you assume that the multiple parameters on each of those model effects on the two logits are same... that is, that the two HARVEST parameters on the two logits are the same and similarly for VARIETY and the interaction. That can be done using the EQUALSLOPES option. (You can even test that equality assumption by also using the UNEQUALSLOPES= option in turn for each of the model effects which seems to show that the assumption is reasonable to the extent that the data can detect it). That model fits without error and shows that the HARVEST*VARIETY interaction is also not significant. So, if you again accept that it has no effect and drop it from the model, that leaves a model with just the HARVEST and VARIETY main effects. (You can play the same game of checking the equal parameters assumption for both effects in that model, and again the assumption seems reasonable). The equal slopes model on just HARVEST and VARIETY as done below again indicates that VARIETY is significant and HARVEST is marginally significant.

proc logistic data=one;
class harvest variety / param=glm;
model y=harvest variety  / link=glogit equalslopes;
run;

Ksharp · Posted 08-26-2024 08:14 PM

StatDave,
If Y variable is scale/rank variable, why not use cumulative logistic model ?

model y=harvest|variety / link=clogit ;

palolix · Posted 08-26-2024 08:51 PM

I guess because in my case the variable (color) is not ordinal.

palolix · Posted 08-26-2024 08:42 PM

The steps were very helpful in order to simplify a model, thank you so much for your great support! I apologize that I was using other data for this code, sorry for the confusion. Now I have two questions:

1. By reducing the number of levels from 8 to 3 in the dependent variable, am I not loosing information and variation?

2. How do I use the unequalslopes option to test equality assumption?

proc logistic data=one;
class Harvest Variety/param=glm;
model PeelColor=Harvest Variety/link=glogit unequalslopes=??;
run;

Thanks a lot!!

StatDave · Posted 08-27-2024 12:35 PM

Yes, of course you are losing information by merging levels, but without doing so you get no model and no information. So, it's a tradeoff to merge some categories so that you can get a possibly useful model. There, of course, could be other approaches using different tradeoffs and you can decide which one is most useful. One attractive alternative is if you just want the test the association of each of the predictors with the response controlling for the other predictors. That can be done using the original data (no need to merge categories) by avoiding modeling altogether and instead use the CMH test in PROC FREQ. The General Association statistic tests the adjusted association of each predictor.

proc freq;
table harvest*weeks*variety*color/ noprint cmh;
table harvest*variety*weeks*color/ noprint cmh;
table variety*weeks*harvest*color/ noprint cmh;
run;

This is an example of using both EQUALSLOPES and UNEQUALSLOPES options to test the equal slopes assumption for Harvest in the submodel with Harvest and Variety on the data with merged categories. Since Harvest has two levels, it has one parameter on each of the two logits that are modeled. The EQUALSLOPES option with UNEQUALSLOPES=HARVEST estimates the parameter on one logit and the difference between that parameter and the parameter on the other logit (labeled with prefix U_).

proc logistic data=one;
class harvest variety / param=glm;
model y=harvest variety / link=glogit equalslopes unequalslopes=harvest;
run;

If the test of that difference parameter is not significant, then you might conclude that the parameters are the same. By removing the UNEQUALSLOPES option, you then get just the estimated common slope for Harvest (and for Variety). That is the model with only the EQUALSLOPES option that I showed earlier.

palolix · Posted 08-27-2024 05:22 PM

Thanks a lot for your reply! It seems like with proc logistic is getting a bit challenging. I will try Proc FREQ as you suggested. What about using proc genmod, specifiying a multinomial distribution? I dont know much, so Im just asking.

I tried your code for using the unequalslopes test for Harvest and this is what I got:

Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
Harvest	0	.	.
U_Harvest	0	.	.
Variety	2	14.4212	0.0007

Analysis of Maximum Likelihood Estimates
Parameter		PeelColor	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		2	1	-2.0236	0.5433	13.8708	0.0002
Intercept		3	1	-1.7612	0.5217	11.3981	0.0007
Harvest	1		1	0.6247	0.5672	1.2128	0.2708
Harvest	55		0	0	.	.	.
U_Harvest	1	2	1	0.5878	0.5562	1.1166	0.2906
U_Harvest	1	3	0	0	.	.	.
U_Harvest	55	2	0	0	.	.	.
U_Harvest	55	3	0	0	.	.	.
Variety	465418_9		1	2.4091	0.6456	13.9251	0.0002
Variety	BL516		1	1.3044	0.5592	5.4415	0.0197
Variety	Hass		0	0	.	.	.

Which p-value do I need to check?

Thank you so much StatDave!!

SAS Innovate 2025: Call for Content