BookmarkSubscribeRSS Feed
daltonchris7720
Calcite | Level 5

Hello SAS experts

 

In a couple of earlier forums I have been discussing modelling out of pockets cost for members of a health plan. There are a large number of zeros in this very large dataset(about 80% of 5.5 million observations) and the non-zero out of pockets are highly right skewed(from a few dollars to $55,000).

 

So I've tried PROC HPFMM and PROC HPGENSLECT with a Tweedie ,Gamma or ZINB distributions, although these are not really count data. The zeros are not a latent class variable either, as they don't come from a different process to those members who get an out of pocket charge. So I don't think finite mixture models, hurdle models or zero inflated models really work here.

 

Theoretically the Tweedie distribution should work but interestingly just doing plain ol' OLS with PROC HPREG seems to work best with the lowest AIC (by a long way with all these models); looking at the raw residuals(attached) with the OLS model shows the bimodal distribution quite well. Is this really valid though given the highly non-normal distribution of this outcome variable?

 

Modelling the out of pocket as binary with PROC HPLOGISTIC works well but I was trying to get a model which predicts the actual out of pocket, hence the attempts with the continuous outcome.

 

Thoughts appreciated.

 

Regards

 

Chris

 

4 REPLIES 4
PGStats
Opal | Level 21

I wouldn't want to send you on a wild chase but how about a finite mixture of constant(0) and exponential?

PG
daltonchris7720
Calcite | Level 5
Thanks for your help.See my reply to Dave.
Regards
StatDave
SAS Super FREQ

A zero-inflated gamma model, which can be done in PROC FMM, would also be a possibility. It would be appropriate for positive, right-skewed, continuous data with a point mass at zero.

daltonchris7720
Calcite | Level 5

Thanks both for your replies.

I tried the FMM models but got really weird predictions. Rick in an earlier post helped me with this.

He pointed out the following in the Details tab of the PROC HPFMM help information under the sub-tab "Log likelihood of the response distributions":

"While it is syntactically valid to mix a constant distribution with a continuous distribution, such as DIST=LOGNORMAL, such a mixture is not mathematically appropriate, because the constant log-likelihood is the log of a probability, while a continuous log-likelihood is the log of a probability density function. If you want to mix a constant distribution with a continuous distribution, you could model the constant as a very narrow continuous distribution, such as DIST=UNIFORM(c-delta,c+delta ) for a small value . However, using PROC HPFMM to analyze such mixtures is sensitive to numerical inaccuracy and ultimately unnecessary. Instead, the following approach is mathematically equivalent and more numerically stable:
Estimate the mixing probability as the proportion of observations in the data set such that |y_i - c|< epsilon.
Estimate the parameters of the continuous distribution from the observations for which |y_i - c|>=epsilon. "

 

Sorry the equations won't copy. The link is

http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_hpfmm_detail...

I wasn't sure how to code this suggestion. Using the uniform as they suggested didn't work either.

My interpretation of FMM models is that they are 2 part models,one process for getting a zero and another for getting a positive outcome, which is not what is happening here.

Anyway I was really just wanting someone to say OLS with PROC HPREG is completely wrong(which is what I think but it seems to work),or you could use that but....

Regards

Chris

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 891 views
  • 0 likes
  • 3 in conversation