BookmarkSubscribeRSS Feed
SSK_011523
Calcite | Level 5

Hi everyone, 

 

I am conducting cost analysis using claims data and it has a large number of zeroes. I would like to use the two-part model but I am not familiar with the code. I have performed the Modified Park test and know that gamma distribution with a log link fits the best. 

As far as I know, the first step is to conduct logistic regression to assign the probability of 0 if the cost is 0 and 1 if the cost is non-zero and get predicted values from this model. I am not sure what to do next once I get predicted values.

 

I referred to this article - 

https://support.sas.com/resources/papers/proceedings15/3600-2015.pdf

 

Any help on this would be appreciated. Thanks!

 

 

3 REPLIES 3
StatDave
SAS Super FREQ

See this note on modeling continuous response data with many zeros. The Tweedie distribution is commonly used since it can accommodate positive data with many zeros. Also, as mentioned in the note, PROC HPCDM in SAS/ETS fits a compound model to loss data consisting of both probability and amount of loss.

SSK_011523
Calcite | Level 5

Hi 

 

Thank you so much for your response. 

 

I don't think I can use Tweedie distribution when my data has gamma distribution (variance is proportional to the square of the mean) as confirmed by the Modified Park test. Most of the studies analyzing health care costs have used GLM( gamma distribution with log link ) as it takes care of heteroskedasticity.

 

This article explains the two-part model on page 490 but doesn't provide SAS code : https://www.annualreviews.org/doi/pdf/10.1146/annurev-publhealth-040617-013517

 

 

 

 

StatDave
SAS Super FREQ

The Tweedie model also allows for heteroscedasticity - as with gamma, the variance is a function of the mean. It is a compound model combining Poisson and gamma - see the "Details: Tweedie Distribution for Generalized Linear Models" section of the GENMOD documentation. Also, the compound model fit by PROC HPCDM *is* a two-part model for frequency of loss and severity of loss - the frequency model can be Poisson, negative binomial, or zero-inflated model, and the loss model is automatically selected as the best fitting among many continuous distributions including gamma and others. See the Getting Started example in the PROC HPCDM documentation.

 

The paper you refer to doesn't seem to suggest a specific model. It suggests first fitting a logit or probit model to the binary response of zero or positive cost, and a GLM based on a chosen distribution on the positive cost responses. The logit or probit binary response, first part model can be fit in PROC LOGISTIC or PROC GENMOD. The GLM second part, continuous response model can be fit using PROC GENMOD. PROC SEVERITY in SAS/ETS can automatically select the best distribution (as is done with HPCDM). The ASSESS statement in GENMOD allows for testing the adequacy of the link function and the specified form of the model.

 

For count responses, it mentions the two-part hurdle model which can be fit with PROC FMM (see this note). A similar two-part count model is the zero-inflated model which can also be fit in FMM (see "Getting Started: Modeling Zero-Inflation" in the FMM documentation) and in PROC GENMOD and PROC GAMPL. See the "Details: Zero-inflated models" section in the GENMOD documentation. FMM could be used in a similar way, as shown for the hurdle and zero-inflated models, to fit a zero-inflated gamma model.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1223 views
  • 2 likes
  • 2 in conversation