Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: OLS model for large zero inflated dataset

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-24-2019 09:23 PM
(1002 views)

Hello SAS experts

In a couple of earlier forums I have been discussing modelling out of pockets cost for members of a health plan. There are a large number of zeros in this very large dataset(about 80% of 5.5 million observations) and the non-zero out of pockets are highly right skewed(from a few dollars to $55,000).

So I've tried PROC HPFMM and PROC HPGENSLECT with a Tweedie ,Gamma or ZINB distributions, although these are not really count data. The zeros are not a latent class variable either, as they don't come from a different process to those members who get an out of pocket charge. So I don't think finite mixture models, hurdle models or zero inflated models really work here.

Theoretically the Tweedie distribution should work but interestingly just doing plain ol' OLS with PROC HPREG seems to work best with the lowest AIC (by a long way with all these models); looking at the raw residuals(attached) with the OLS model shows the bimodal distribution quite well. Is this really valid though given the highly non-normal distribution of this outcome variable?

Modelling the out of pocket as binary with PROC HPLOGISTIC works well but I was trying to get a model which predicts the actual out of pocket, hence the attempts with the continuous outcome.

Thoughts appreciated.

Regards

Chris

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I wouldn't want to send you on a wild chase but how about a finite mixture of constant(0) and exponential?

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for your help.See my reply to Dave.

Regards

Regards

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks both for your replies.

I tried the FMM models but got really weird predictions. Rick in an earlier post helped me with this.

He pointed out the following in the Details tab of the PROC HPFMM help information under the sub-tab "Log likelihood of the response distributions":

"While it is syntactically valid to mix a constant distribution with a continuous distribution, such as DIST=LOGNORMAL, such a mixture is not mathematically appropriate, because the constant log-likelihood is the log of a probability, while a continuous log-likelihood is the log of a probability density function. If you want to mix a constant distribution with a continuous distribution, you could model the constant as a very narrow continuous distribution, such as DIST=UNIFORM(c-delta,c+delta ) for a small value . However, using PROC HPFMM to analyze such mixtures is sensitive to numerical inaccuracy and ultimately unnecessary. Instead, the following approach is mathematically equivalent and more numerically stable:

Estimate the mixing probability as the proportion of observations in the data set such that |y_i - c|< epsilon.

Estimate the parameters of the continuous distribution from the observations for which |y_i - c|>=epsilon. "

Sorry the equations won't copy. The link is

I wasn't sure how to code this suggestion. Using the uniform as they suggested didn't work either.

My interpretation of FMM models is that they are 2 part models,one process for getting a zero and another for getting a positive outcome, which is not what is happening here.

Anyway I was really just wanting someone to say OLS with PROC HPREG is completely wrong(which is what I think but it seems to work),or you could use that but....

Regards

Chris

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.