Solved: Re: zero-inflated multinomial data

palolix · Posted 10-10-2024 07:15 PM

Dear SAS Community,

I am trying to analyze multinomial dependent variables that have mostly zeroes. When I analyze these variables includíng the interactions of the factors in the model I am getting these warnings:

The negative of the Hessian is not positive definite. The convergence is questionable.

The procedure is continuing but the validity of the model fit is questionable

The specified model did not converge

However, if I only include the main factors in the model but not the interactions, then I no longer get the warnings.

Since I am only getting these warnings with those dependent variables that have too many zeroes, I am assuming it is due to zero-inflated data. It seems like genmod only has the option of zero-inflated data for Poisson or neg bin distributions, but not for multinomial data. I would greatly appreciate your feedback on this.

This is the code I am ussing (I also attached the data):

proc genmod data=one;
by Variety;
class Season Harvest Weeks;
model Easyofpeeling=Season| Harvest| Weeks /type3 dist=multinomial link=cumlogit;
run;

Thank you very much!

Caroline

StatDave · Posted 10-14-2024 01:00 PM

The FIRTH option is only available with binary response data. In sparse cases, another simplification besides removing model effects like interactions is to combine categories of any categorical variables. If the sparseness can be removed by doing this, then you might be able to estimate some interactions. How and in which variables to combine categories is a trial-and-error thing, but the biggest effect and place to start is with the multinomial response. Combining response categories to create fewer response levels does the most to reduce the number of parameters that must be estimated. If that isn't enough, then start combining categories in the predictors. Obviously the categories that have very few observations in some response levels are the ones to combine first.

View solution in original post

StatDave · Posted 10-10-2024 10:43 PM

The problem is not the zero values in the response variable. For a multinomial categorical variable, zero is just another category and the distribution does not restrict the proportion of zeros like with, say, the continuous gamma distribution. The problem here is that you have several response categories and also specify all possible interactions resulting in a complex model with many parameters to be estimated and the model complexity makes the data in each variety too sparse. The result, just like for binary logistic models, is that some model parameters are actually infinite. Since computers don't deal in infinities, the practical result is that some parameters are large with even larger standard errors and/or some parameters have zero degrees of freedom and are not estimated. The solution is to simplify the model in any way acceptable to you such as by combining response levels (which will have the biggest benefit) and/or removing some or all interactions. Because the amount of data varies so much by variety, you should not expect that you can just specify one model and use BY VARIETY and get a proper fit for each variety unless you find a much simpler model that can be successfully fit in every variety. And by the way, for logistic models like this (binary or multinomial), PROC LOGISTIC is the better procedure to use than PROC GENMOD as it is more specialized for those models.

SteveDenham · Posted 10-11-2024 09:50 AM

An alternative to removing the interactions might be to fit ONLY the highest level interaction, and then use specific CONTRAST or ESTIMATE statements to calculate your odds ratios. This will also quickly identify the cells with small sample sizes.

SteveDenham

StatDave · Posted 10-11-2024 10:51 AM

Actually, I don't think this will help - the degrees of freedom of the other interactions are absorbed into the 3-way so you end up with just as many parameters requiring estimation.

palolix · Posted 10-11-2024 04:20 PM

Thank you very much for your input Steve. Since I think the problem is due to unbalanced data for some varieties I was still getting the warnings even when just including the main effects in the model.

Thank you

Caroline

palolix · Posted 10-11-2024 04:17 PM

Thank you so much for your comprehensive feedback StatDave! I simplified the model as much as I could in proc logistic. I think I now know what the main issue is. I followed your advice fitting the model for each variety separately and I noticed that for 'Hass' I dont get any warning because this variety was measured in every ocasion (balanced data), but if I run it for another variety that has very unbalanced data then I get these warnings:

There is possibly a quasi-complete separation of data points. The maximum likelihood
estimate may not exist.

The LOGISTIC procedure continues in spite of the above warning. Results shown are based
on the last maximum likelihood iteration. Validity of the model fit is questionable

So to me it seems like main problem is the unbalanced data for some of the varieties that were not harvested consistently like Hass which is the standard.

Question: Is it possible to use two where statements in proc logistic? So that I can fit the model for each variety and season.

proc logistic data=one;
where Variety= 'Hass';
class Season Harvest Weeks/param=glm;
model Easyofpeeling=Season Harvest Weeks/ link=glogit ;
run;

Thank you so much!

Caroline

StatDave · Posted 10-11-2024 09:01 PM

The lack of balance, meaning unequal numbers of observations in the various predictor combinations, is not itself a problem. It is the extreme case of this when some of the combinations have no observations. That is when "separation" occurs resulting in some parameters being infinite as I mentioned. In these cases, you might be able to fit the model with at least some interactions by using a penalized likelihood. This can be done by simply adding the FIRTH option in the MODEL statement.

Regarding your question - you don't need two WHERE statements because you can specify a single WHERE statement with multiple conditions such as: where varietey='Hass" and Season=2022;

palolix · Posted 10-14-2024 11:45 AM

Thank you so much for your great suggestions StatDave. Some varieties are only harvested in month 1, 3, 4, 6 and 8, unlike Hass that is harvested almost every month, so fiiting the model for each variety and season, and just testing for main effects without interactions solved the problem. So I learned the lesson on symplifying the model as much as I can. Also, the firth option worked wonderfully for the binary response variables. Is there a similar option for multinomial data?

Thank you!

Caroline

StatDave · Posted 10-14-2024 01:00 PM

The FIRTH option is only available with binary response data. In sparse cases, another simplification besides removing model effects like interactions is to combine categories of any categorical variables. If the sparseness can be removed by doing this, then you might be able to estimate some interactions. How and in which variables to combine categories is a trial-and-error thing, but the biggest effect and place to start is with the multinomial response. Combining response categories to create fewer response levels does the most to reduce the number of parameters that must be estimated. If that isn't enough, then start combining categories in the predictors. Obviously the categories that have very few observations in some response levels are the ones to combine first.

palolix · Posted 10-14-2024 02:11 PM

That makes a lot of sense, I will do so. Thank you so much StatDave!

SAS Innovate 2025: Register Now