Re: OLS model for large zero inflated dataset

daltonchris7720 · Posted 09-24-2019 09:23 PM

Hello SAS experts

In a couple of earlier forums I have been discussing modelling out of pockets cost for members of a health plan. There are a large number of zeros in this very large dataset(about 80% of 5.5 million observations) and the non-zero out of pockets are highly right skewed(from a few dollars to $55,000).

So I've tried PROC HPFMM and PROC HPGENSLECT with a Tweedie ,Gamma or ZINB distributions, although these are not really count data. The zeros are not a latent class variable either, as they don't come from a different process to those members who get an out of pocket charge. So I don't think finite mixture models, hurdle models or zero inflated models really work here.

Theoretically the Tweedie distribution should work but interestingly just doing plain ol' OLS with PROC HPREG seems to work best with the lowest AIC (by a long way with all these models); looking at the raw residuals(attached) with the OLS model shows the bimodal distribution quite well. Is this really valid though given the highly non-normal distribution of this outcome variable?

Modelling the out of pocket as binary with PROC HPLOGISTIC works well but I was trying to get a model which predicts the actual out of pocket, hence the attempts with the continuous outcome.

Thoughts appreciated.

Regards

Chris

PGStats · Posted 09-25-2019 12:26 AM

I wouldn't want to send you on a wild chase but how about a finite mixture of constant(0) and exponential?

PG

daltonchris7720 · Posted 09-26-2019 05:44 AM

Thanks for your help.See my reply to Dave.
Regards

StatDave · Posted 09-25-2019 10:24 AM

A zero-inflated gamma model, which can be done in PROC FMM, would also be a possibility. It would be appropriate for positive, right-skewed, continuous data with a point mass at zero.

daltonchris7720 · Posted 09-26-2019 05:41 AM

Thanks both for your replies.

I tried the FMM models but got really weird predictions. Rick in an earlier post helped me with this.

He pointed out the following in the Details tab of the PROC HPFMM help information under the sub-tab "Log likelihood of the response distributions":

"While it is syntactically valid to mix a constant distribution with a continuous distribution, such as DIST=LOGNORMAL, such a mixture is not mathematically appropriate, because the constant log-likelihood is the log of a probability, while a continuous log-likelihood is the log of a probability density function. If you want to mix a constant distribution with a continuous distribution, you could model the constant as a very narrow continuous distribution, such as DIST=UNIFORM(c-delta,c+delta ) for a small value . However, using PROC HPFMM to analyze such mixtures is sensitive to numerical inaccuracy and ultimately unnecessary. Instead, the following approach is mathematically equivalent and more numerically stable:
Estimate the mixing probability as the proportion of observations in the data set such that |y_i - c|< epsilon.
Estimate the parameters of the continuous distribution from the observations for which |y_i - c|>=epsilon. "

Sorry the equations won't copy. The link is

http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_hpfmm_detail...

I wasn't sure how to code this suggestion. Using the uniform as they suggested didn't work either.

My interpretation of FMM models is that they are 2 part models,one process for getting a zero and another for getting a positive outcome, which is not what is happening here.

Anyway I was really just wanting someone to say OLS with PROC HPREG is completely wrong(which is what I think but it seems to work),or you could use that but....

Regards

Chris

OLS model for large zero inflated dataset