I am building models whose dependent variable (1) is zero-inflated (or, in the econometric literature, "limited"); (2) is continuous in nature and (3) follows a lognormal distribution in the non-zero part.
I have reviewed the literature and found that a two part model maximizing a joint likelihood might be suitable. The problem is that this process does not seem to be supported by built-in procedure of SAS. So I decide to formulize the joint likelihood and let SAS maximize it and report the results.
I am not going into the details of this joint likelihood, but in case someone needs to know it, I will mention it briefly. The joint likelihood concerns the product of the probability that a given sample point exceeds zero and the expected value of that sample point given that it has exceeded zero.
Is there anyway I can tailor my likelihood function to maximize with built-in SAS procedures? References on this issue are also welcome.
Thank you!
I think your 2-parts are :
With PROC NLMIXED you can maximize the (log-)likelihood jointly.
However, it could be (will be) the likelihood separates anyway, so you don't get improved parameter estimates as a result (only advantage then is that you do it with one function call and you estimate the combined function E(y) which includes the zeroes).
Note that you can use PROC NLMIXED in the absence of random effects (only fixed effects is fine here).
There are other ways in SAS to maximize (any tailored) likelihood beyond PROC NLMIXED.
Koen
Thank you, Koen, for your detailed reply!
@sbxkoenk wrote:
- "Zero-inflated" response and "limited" response are not the same.
- Are you dealing with count data or continuous data (0 and above)?
The nomenclature "limited dependent variable", albeit strange and not that intuitive to me, comes from the econometric literature. For instance, I have retrieved two books (Limited-Dependent and Qualitative Variables in Econometrics (cambridge.org) and Analysis of Panels and Limited Dependent Variable Models (cambridge.org)), skimmed through their contents and found that the so-called "limited dependent variables" are in fact sample-selected variables. On many occassions, they are zero-inflated ones as well. I averted using the term "zero-inflated" in the title to avoid people who are attracted by this pharse, come straight into this post and tell me to use PROC GENMOD to tackle this problem because PROC GENMOD can model zero-inflated count data, without noticing the fact that the variable I wish to model is continuous rather than discrete, an issue whose solution has been given in SAS Help as well as many other literature.
As I have said in the post and the paragraph above, the variable I wish to model is continuous.
@sbxkoenk wrote:
- Are you talking about zero-inflated models (gamma, lognormal, Poisson, negative binomial) or are you talking about (Gaussian) mixture models?
I am not sure the exact definition of a "lognormal zero-inflated model" as I have not yet seen this phrase in the literature on zero-inflated models that I have come across. If this model refers to a model that is capable of modeling a variable that is zero-inflated in nature and whose non-zero part follows a lognormal distribution, then that is the model I wish to build.
By the way, I am not sure about the definition of (Gaussian) mixture models. Do you mean finite mixture models whose dependent variables follow a finite mixture of (normal) distributions that can be built by PROC FMM? To the best of my knowledge, these models are not typically classified as zero-inflated models. Can they model zero-inflated data as well?
@sbxkoenk wrote:I think your 2-parts are :
- a logistic regression to P(y=0) and
- a gamma (or log-normal) error regression with log link to E(y | y>0))
I do not think that if the link function is the natural log of the expectation of the dependent variable given that it has exceeded zero (lnE(y|y>0)), the errors would still be log-normal. But that is a trivial issue. Aside from that, what you have outlined is exactly what I want.
@sbxkoenk wrote:With PROC NLMIXED you can maximize the (log-)likelihood jointly.
However, it could be (will be) the likelihood separates anyway, so you don't get improved parameter estimates as a result (only advantage then is that you do it with one function call and you estimate the combined function E(y) which includes the zeroes).
Thank you for your reminder! I have been reading a monograph (Regression Models: Censored, Sample Selected, or Truncated Data (Quantitative Applications in the So...) on zero-inflated continuous data, which has also been termed as "sample selected data". When elaborating the way to model a zero-inflated variable whose non-zero part follows a normal distribution, the author demonstrated the inappropriateness of not maximizing the joint likelihood in that ordinary least squares estimators of the regression coefficients conducted on the non-zero portion, on the entire sample are all biased and (or) inconsistent, except in rare conditions that is, in my opinion, hard to verify with neither the data at hand nor professional knowledge. On the contrary, the estimator of the regression coefficients obtained by maximizing the joint likelihood is guaranteed to be unbiased and consistent. So I think it a safer choice.
@sbxkoenk wrote:Note that you can use PROC NLMIXED in the absence of random effects (only fixed effects is fine here).
There are other ways in SAS to maximize (any tailored) likelihood beyond PROC NLMIXED.
Koen
Thank you for pointing out the fact that PROC NLMIXED can be of help! Could you please provide some details on the other ways of maximizing tailored likelihood functions that you mentioned?
Thanks!
See this note which describes models you could consider for a continuous response that is zero inflated.
Thank you so much! The topic of this note perfectly matches my need. I will go into its details to see if this encompasses the exact modeling method I wish to apply.
There are a long-standing questions regarding Tweedie models, as the explanation regarding Tweedie distribution in SAS Help is too hard for me to understand. Moreover, I have searched extensively on the Internet and have retrieved little information on this distribution.
My questions are: are Tweedie models suitable for modeling all zero-inflated continuous variables? Moreover, are there are goodness-of-fit statistics regarding Tweedie models?
Thanks!
Yes, as stated in the note I referred to and in "Tweedie Distribution for Generalized Linear Models" in the Details section of the GENMOD documentation, the Tweedie distribution can be used for continuous response data with a mass of zero values. Unlike distributions with positive support like the gamma and log-normal, the Tweedie distribution supports zero values. I show a simple R-square-like statistic in the note for use as a goodness of fit measure and use it to compare the models presented there. The statistic is discussed further in the provided link.
There is a note to make about using Tweedie models for zero-inflated data for friends in this Community who may not have the time I do to read the note cited by @StatDave thoroughly. This note informs me that a Tweedie model "never produces a predicted mean exactly equal to zero". That potentially makes it a bad tool for prediction because in reality, there are of course zero-valued observations. After all, we are modeling zero-inflated data! Yet things are different if you are trying to explore the association of the dependent variable and the independent variables. I think if that is the objective, Tweedie models can be selected.
Not sure I see the problem with prediction. While the distribution mean cannot be zero, individual realizations from the distribution can be zero. For instance, the PDF at zero for a Tweedie distribution with p=1.5 and mean=1 is nonzero:
data _null_; p=pdf('tweedie',0,1.5,1); put p=; run;
When I was mentioning the utility of Tweedie models in prediction, my emphasis was on its predicted mean. The fact that the predicted mean cannot be zero is in conflict with the fact that the dependent variable is in fact a zero-inflated variable. The number you mentioned was the value of the probability density function of the Tweedie distribution at a given point.
In one word, it is usually the model predictive mean that is used for prediction.
For instance, suppose we are building a model on the amount of money an insurance company pays for reimbursement. When we say we want to predict this quantity, we mean that we wish to know the amount of money the insurance company reimburses given the values the predictors (i.e., independent variables) take. This usually amounts to calculating the predicted mean by the model given the value of the predictors. Now that the predictive mean of Tweedie model is always larger than zero, we, in the fictitious scenario mentioned above, will claim that the amount of reimbursement is always larger than zero, since this is what the Tweedie model tells us in terms of its predictive mean. But that is definitely not true. A lot of zeros are actually observed. That is why the data is termed as zero-inflated data.
However, if the researcher is interested on the factors associated with the dependent variable rather than hoping to find out the expected mean given values of predictors, then Tweedie models can be selected. Following the scenario above, suppose we are interested in the correlation of the amount of money reimbursed and age instead of trying to figure out the amount of money the company is expected to pay given that the person's age is, say, 69. In this case, the absolute amount of the regression coefficient estimate as well as its sign (+ or -) can provide us information regarding the relationship of the amount of money reimbursed and age.
This is another issue I would like to consult on. Is there any SAS solution to the very setting I mentioned in the post, except for the fact that the data were collected from complex surveys and hence statistical methods tailored for these data (like the ones offered by the SURVEYMEANS procedure) are needed?
Although not solved yet, I have made some progress on the question I raised in the original post and would like to share my findings.
(1) Theoretical issues: the non-zero part of my dependent variable follows a lognormal distribution. Theoretical discussions revolving the maximum likelihood estimation of these models can be found in Amazon.com: Generalized Linear Models for Categorical and Continuous Limited Dependent Variables (Ch.... Moreover, heteroscedasticity of limited dependent variable models is an problem I encountered and named by the authors of this book as an issue extensively discussed therein. Truncated and Censored Samples: Theory and Applications (Statistics: A Series of Textbooks and Monog... also contains a brief summary of maximum likelihood estimators of limited dependent variable models whose dependent variable follows a lognormal distribution.
(2) Software implementation: Amazon.com: Maximum Likelihood Estimation and Inference: With Examples in R, SAS and ADMB: 978047009... is a monograph on maximum likelihood estimation (MLE) with special emphasis on software implementation, including issues of conducting MLE with SAS.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.