turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Zero-Inflated Models

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-25-2010 11:19 AM

Looking for nice approach to modelling economic data with many zero's followed by a logrithmic distribution. The common approach is to model event (0 or >0) and the mean of those >0 seperately. I know that a ZIP will work for count data...I am wondering if there is an equivalent for continiuous data?

I have come across the use of Generalized Additive Models as a potential solution and something referred to as an exponentially compound Poisson process but, of course, prying these out of SAS will take me a lifetime.

Any suggestions?

I have come across the use of Generalized Additive Models as a potential solution and something referred to as an exponentially compound Poisson process but, of course, prying these out of SAS will take me a lifetime.

Any suggestions?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-25-2010 12:59 PM

I am not sure what you mean by a "logarithmic" distribution. That could be any continuous distribution in which the response is positive (as far as I know).

As such, that could include a a zero-inflated gamma. I addressed fitting a zero-inflated gamma model on SAS-L a couple of years ago. See the post at:

http://listserv.uga.edu/cgi-bin/wa?A2=ind0805A&L=sas-l&P=R20779

Note that the zero-inflated gamma (or a zero-inflated log-normal, or ...) has a likelihood which is identical to fitting a logit model for the probability of a zero response plus the (gamma, log-normal, ...) likelihood of the positive response. That is, the parameter estimates for the joint model would be identical to parameter estimates obtained by fitting a model for the probability of a zero value and also fitting a (gamma distribution) model using the observations with positive value. The only thing that you gain by fitting a zero-inflated model is the opportunity to compute a standard error for the mean (including zero values in the mean) which may or may not have any desirable statistical properties.

As such, that could include a a zero-inflated gamma. I addressed fitting a zero-inflated gamma model on SAS-L a couple of years ago. See the post at:

http://listserv.uga.edu/cgi-bin/wa?A2=ind0805A&L=sas-l&P=R20779

Note that the zero-inflated gamma (or a zero-inflated log-normal, or ...) has a likelihood which is identical to fitting a logit model for the probability of a zero response plus the (gamma, log-normal, ...) likelihood of the positive response. That is, the parameter estimates for the joint model would be identical to parameter estimates obtained by fitting a model for the probability of a zero value and also fitting a (gamma distribution) model using the observations with positive value. The only thing that you gain by fitting a zero-inflated model is the opportunity to compute a standard error for the mean (including zero values in the mean) which may or may not have any desirable statistical properties.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-25-2010 01:24 PM

Thanks for your response Dale,

What I mean is that my non-zero responses are in fact distributed on a lograthmic curve.

I will take a look at your link below...I have just come across a paper proposing the use of a probit - (log skew) normal model invoked through NLMIXED, which seems to be a good fit for what I looking at except for the distribution.

What I mean is that my non-zero responses are in fact distributed on a lograthmic curve.

I will take a look at your link below...I have just come across a paper proposing the use of a probit - (log skew) normal model invoked through NLMIXED, which seems to be a good fit for what I looking at except for the distribution.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-25-2010 01:40 PM

Looks like what I came across is similar to what you suggested.

As my NLMIXED skills are limited, I think I will take your last comment and stick with modelling both event (y=$0 vs y>$0) and means (if y>$0) seperately...I was under the assumption that this joint model approach would allow for appropriate estimation of the mean by accounting for the difficulties brought on by the "many zero's" problem. Or perhaps I am not interpreting your comment correctly...

As my NLMIXED skills are limited, I think I will take your last comment and stick with modelling both event (y=$0 vs y>$0) and means (if y>$0) seperately...I was under the assumption that this joint model approach would allow for appropriate estimation of the mean by accounting for the difficulties brought on by the "many zero's" problem. Or perhaps I am not interpreting your comment correctly...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-25-2010 04:06 PM

Yes, you would get a appropriate estimate of the mean by using the joint model. But you can get the same mean estimate modeling the zero probability and the positive values separately. Note that the full-data mean is P(Y>0)*(Ybar|Y>0) where P(Y>0) is the probability that Y is greater than zero and (Ybar|Y>0) is the mean for Y when restricted to the set of positive observations.

It is not estimation of the mean that is the problem. It is estimating the variance of the mean which is the problem. The solution employing NLMIXED will provide an estimate of the variance of the mean. But it is not clear that the estimate of the variance obtained for this sort of problem is appropriate.

You might want to employ a bootstrap approach to determine the distribution of the mean. That would be better than assuming that the mean is approximately normally distributed with a variance as estimated by the NLMIXED procedure.

It is not estimation of the mean that is the problem. It is estimating the variance of the mean which is the problem. The solution employing NLMIXED will provide an estimate of the variance of the mean. But it is not clear that the estimate of the variance obtained for this sort of problem is appropriate.

You might want to employ a bootstrap approach to determine the distribution of the mean. That would be better than assuming that the mean is approximately normally distributed with a variance as estimated by the NLMIXED procedure.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-30-2010 08:49 AM

Sooo, what you are suggesting is that we have yet to come up with a generally acceptable approach to modelling continous data with many zero's in the same way that the ZI group of approaches has done for count data?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

08-30-2010 04:32 PM

No, that is not what I am saying. ZIP and ZINB models both partition the zero values into some part that is attributable to the Poisson (or negative binomial) distribution and some part that is attributable to an extra-zeroes portion. Typically, one does not estimate the overall mean value taking into account the two different distributional components.

There is no such partitioning of zero values when you have a continuous, positive response along with some zero values. The zero values are known to be from a single distribution. So, there is no need to simultaneously estimate parameters of the two distributional components in order to disentangle the distributional parameters. You can just fit a regression model for whether the response is zero-valued (using all of the data) and also fit a separate regression model to the observations which have positive value (to get parameters of your "logarithmic" distribution).

But you want to extend the concept of these models to estimating a person-specific mean value that takes into account the zero probability model and the positive value expectation. This is not something that is typically done for the ZIP and ZINB models (to my knowledge).

The estimate of the expectation in the entire data including the zero values and the positive values can be easily obtained. I have already stated that. Regardless of whether you estimate the parameters of the two components simultaneously or whether you estimate the parameters of the two components in separate regressions, you can compute the expectation. But whether the estimated standard error of the expectation is a good statistic is something which I don't know. As I stated above, I don't believe that inferences about the expectation are necessarily part of a ZIP or ZINB model. This may be an area that requires further investigation. The standard error may be just fine. But I would not want to assume that it is OK without investigating the properties of the estimate of the SE.

There is no such partitioning of zero values when you have a continuous, positive response along with some zero values. The zero values are known to be from a single distribution. So, there is no need to simultaneously estimate parameters of the two distributional components in order to disentangle the distributional parameters. You can just fit a regression model for whether the response is zero-valued (using all of the data) and also fit a separate regression model to the observations which have positive value (to get parameters of your "logarithmic" distribution).

But you want to extend the concept of these models to estimating a person-specific mean value that takes into account the zero probability model and the positive value expectation. This is not something that is typically done for the ZIP and ZINB models (to my knowledge).

The estimate of the expectation in the entire data including the zero values and the positive values can be easily obtained. I have already stated that. Regardless of whether you estimate the parameters of the two components simultaneously or whether you estimate the parameters of the two components in separate regressions, you can compute the expectation. But whether the estimated standard error of the expectation is a good statistic is something which I don't know. As I stated above, I don't believe that inferences about the expectation are necessarily part of a ZIP or ZINB model. This may be an area that requires further investigation. The standard error may be just fine. But I would not want to assume that it is OK without investigating the properties of the estimate of the SE.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

09-01-2010 01:51 PM

In my situation, I am interested in inferences about the difference in group means in a 2X2 factorial design (with 2 blocking factors just for good measure)...in order to do so, I need to know the variance and therefore the standard error of my groups of interest, and therefore, based on your comments modeling two seperate outcomes with proc logistic and proc mixed is not the most appropriate approach, despite its simplicity...

I guess, using your reference above, I have two questions.

1) How to come up with adequate starting point for the parameters in the param statement?

2) Do I simply extend the logic for handling a blocked, factorial design (with repeated measures) as I would if this were proc mixed?

The more I look at NLMIXED, the scarier it gets...

I guess, using your reference above, I have two questions.

1) How to come up with adequate starting point for the parameters in the param statement?

2) Do I simply extend the logic for handling a blocked, factorial design (with repeated measures) as I would if this were proc mixed?

The more I look at NLMIXED, the scarier it gets...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

09-01-2010 03:29 PM

Typically, zero values are safe initial parameter estimates for most parameters. A zero value cannot be employed to initialize a variance. However, if the model is parameterized such that you don't estimate the variance directly, but instead parameterize the model to estimate the log of the variance (or, log of the square root of the variance), then a zero-value for the parameter which represents log(Variance) or log(SD) is a reasonable initial parameter.

The blocking factors will introduce random effects into the model, right? You don't say whether those blocking factors are crossed or nested. The NLMIXED procedure cannot handle crossed random effects. The NLMIXED procedure has some ability to handle designs with nested random effects. But the number of levels of the nested blocking factor need to be relatively small if you are to fit the nested design using the NLMIXED procedure. What exactly is your design?

The blocking factors will introduce random effects into the model, right? You don't say whether those blocking factors are crossed or nested. The NLMIXED procedure cannot handle crossed random effects. The NLMIXED procedure has some ability to handle designs with nested random effects. But the number of levels of the nested blocking factor need to be relatively small if you are to fit the nested design using the NLMIXED procedure. What exactly is your design?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

09-03-2010 12:04 PM

2X2 Factorial with 2 blocks...crossed.

So I guess Ill go back to original thought...2 seperate models

So I guess Ill go back to original thought...2 seperate models

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

10-12-2010 07:46 AM

Same topic, new issue.

Using the concepts previously stated, I can see how modelling 1) the binary responses (0 vs. >0) and 2) for those that are positive, the continuous responses, works well in most cases. In my case however, I have longitudinal data...here is my problem...

obs time1 time2 time3

1 $1232 $0 $1121

2 $119 $989 $0

3 $0 $0 $3411

If I had only one response variable, both types of responses can be handled based on the seperate models approach you suggested. However, although my binary (0 vs. >0) logistic model holds in the example above, I believe I would need to drop all three of these observations, as the entire point of modelling the 0's seperately is so that I can obtain an accurate mean response of the continous variables...keeping those obs with 0 values goes against what I am trying to accomplish, dropping means a large loss of data.

Any thoughts?

Using the concepts previously stated, I can see how modelling 1) the binary responses (0 vs. >0) and 2) for those that are positive, the continuous responses, works well in most cases. In my case however, I have longitudinal data...here is my problem...

obs time1 time2 time3

1 $1232 $0 $1121

2 $119 $989 $0

3 $0 $0 $3411

If I had only one response variable, both types of responses can be handled based on the seperate models approach you suggested. However, although my binary (0 vs. >0) logistic model holds in the example above, I believe I would need to drop all three of these observations, as the entire point of modelling the 0's seperately is so that I can obtain an accurate mean response of the continous variables...keeping those obs with 0 values goes against what I am trying to accomplish, dropping means a large loss of data.

Any thoughts?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

10-12-2010 08:56 AM

What I meant to say was....

If I had only one response variable, both types of responses can be handled based on the seperate models approach you suggested. However, although my binary (0 vs. >0) logistic model holds in the example above,**for the continous model I believe I would need to drop all three of these observations**, as the entire point of modelling the 0's seperately is so that I can obtain an accurate mean response of the continous variables...keeping those obs with 0 values goes against what I am trying to accomplish, dropping means a large loss of data.

If I had only one response variable, both types of responses can be handled based on the seperate models approach you suggested. However, although my binary (0 vs. >0) logistic model holds in the example above,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

11-19-2010 03:04 PM

This may be too late for you, but if you have sas 9.22, check out the new (experimental) procedure called PROC SEVERITY. Below is a quote from the User Guide: I have not used it yet, but it has great potential for dealing with "unusual" continuous distributions.

The SEVERITY procedure estimates parameters of any arbitrary continuous probability distribution that is used to model magnitude (severity) of a continuous-valued event of interest. Some examples of such events are loss amounts paid by an insurance company and demand of a product as depicted by its sales. PROC SEVERITY is especially useful when the severity of an event does not follow typical distributions, such as the normal distribution, that are often assumed by standard statistical methods.

PROC SEVERITY provides a default set of probability distribution models that includes the Burr, exponential, gamma, generalized Pareto, inverse Gaussian (Wald), lognormal, Pareto, and Weibull distributions. In the simplest form, you can estimate the parameters of any of these distributions by using a list of severity values that are recorded in a SAS data set. The values can optionally be grouped by a set of BY variables. PROC SEVERITY computes the estimates of the model parameters, their standard errors, and their covariance structure by using the maximum likelihood method for each of the BY groups.

The SEVERITY procedure estimates parameters of any arbitrary continuous probability distribution that is used to model magnitude (severity) of a continuous-valued event of interest. Some examples of such events are loss amounts paid by an insurance company and demand of a product as depicted by its sales. PROC SEVERITY is especially useful when the severity of an event does not follow typical distributions, such as the normal distribution, that are often assumed by standard statistical methods.

PROC SEVERITY provides a default set of probability distribution models that includes the Burr, exponential, gamma, generalized Pareto, inverse Gaussian (Wald), lognormal, Pareto, and Weibull distributions. In the simplest form, you can estimate the parameters of any of these distributions by using a list of severity values that are recorded in a SAS data set. The values can optionally be grouped by a set of BY variables. PROC SEVERITY computes the estimates of the model parameters, their standard errors, and their covariance structure by using the maximum likelihood method for each of the BY groups.