Hello
I have some questions regarding the output with PROC FMM,which I am new to.
I am using a dataset claimAmount as the outcome variable( a health insurance claim).Most of these equal zero, so the dataset( a random sample of a larger dataset) is heavily zero inflated. I'm trying the FMM procedure with a log normal and constant distributions,as in the code.
proc fmm data =log_claim gconv=0;
class policy gender Diag;
model claimAmount=age policy gender Diag/dist=Lognormal;
model claimAmount=/dist=constant ;
probmodel age policy gender Diag;
output out=out pred xbeta;
run;
My questions are:
1. The predictions in the output dataset seem odd,which I think is because of the small numbers of non-zero claim amounts in this smaller random sample dataset. Is this correct?
2.What do the estimates for the mixing probabilities actually mean,and how do we interpret these in relation to the log normal estimates?
Thanks for any advice.
Regards
Chris
I believe the correct syntax for a zero-inflated model is to use
MODEL + / dist=constant;
as the second MODEL statement.
See the documentation example for the zero-inflated Poisson distribution, which also describes how to interpret the output from the procedure. In particular, the mixing probabilities are the estimated proportion of the density that is attributed to each distribution. For example, if the mixing probabilities are estimated by 0.9 and 0.1, that means that the mixture density is 0.9*LN(x) + 0.1*C_0(x), where LN is the lognormal density function and C_0(x) is the point-mass density function (Dirac delta function) that has the value 1 if x=0 and the value 0 otherwise.
Thanks Ric.
I'm still getting odd predictions using the "+" in the second model statement and the estimates for the mixing probabilities show only component 1 (attached). Does this mean the model did not work or mix properly? Perhaps this is why the predictions seem odd.
Regards
Chris
I have two comments/questions:
1. Are you sure that you understand what the PROBMODEL statement is doing? I've never used it myself because I don't understand it. Unless you understand how to use that statement, I suggest that you delete the statement so that your model contains two mixing probabilities, one for the LN component and one for the zero component.
2. When you say you are getting "odd predictions," how are you judging them? Why do you think they are odd? If you are merely graphing the predicted values versus a continuous explanatory variable (such as Age), then the graph will look strange because of the presence of the CLASS variables. For an example and a discussion, see "Visualize multivariate regression models by slicing continuous variables."
Because you have used the PROBMODEL statement, you are saying that the mixing probabilities depend upon the estimated parameters from those effects ( i.e. age, policy, gender, diag). Since you only have 2 components, and the sum of the mixing probabilites is 1, it is only necessary to know the estimates for the first component.
Those estimates do not, per se, have anything to do with the lognormal component estimates. They are simply estimates that enable you to calculate the mixing probabilties between the two components.
Can you be more specific about the predictions being odd?
To expand on Rob's comments, when you use the PROBMODEL statement, you no longer have two mixing probabilities. You get (potentially) one for every observation. (If you use only CLASS variables on the PROBMODEL stmt, you get a set of mixing probabilities for each joint level.) Perhaps this is why you say the predictions are odd?
Thanks Ric and Rob for your helpful advice.
Leaving out the PROBMODEL statement gives the overall mixing probabilities for the 2 components; I think I understand Rob's comment when we leave the PROBMODEL statement in,which splits the probabilities for each estimate.
The predictions are odd because you get quite large predicted values of claimAmount when it was zero in the data, but I guess this is how the model mixes the components, with higher probability from the lognormal component.
i have attached the output dataset for you. The linear predictors make sense(Linp_1 and Linp_2) but not the predictions.
Regards
Chris
I discussed this question with a colleague who knows much more about finite mixture models than I do, He pointed out a section of the documentation that discusses the appropriateness of mixing a constant (discrete) distribution with a continuous distribution such as the lognormal. In the section that describes the CONSTANT distribution, the doc says
Although it is syntactically valid to mix a constant distribution with a continuous distribution, such as DIST=LOGNORMAL, such a mixture is not mathematically appropriate, because the constant log likelihood is the logarithm of a probability, whereas a continuous log likelihood is the logarithm of a probability density function.
The doc goes on to describe a general approach that is "mathematically equivalent" and "numerically stable."
Regarding the PROBMODEL statement, my colleague said, "If the person asking the question is keen on using the PROBMODEL to distinguish the components, they can use a logistic model with effects to do so, and then model the non-zero responses with a log normal."
Thanks again Ric.
I'll read this documentation and see if I can some up with something more robust.
I appreciate your interest in this issue,and it's very valuable having such expertise readily available.
Regards
Chris
Hi Ric
The link mentioned above seems to be broken; which document were you referring to?
Regards
Chris
Hmmm, strange. Sorry about that. Here is a link that ought to work
FMM Procedure: Log-Likelihood Functions for Response Distributions
Thanks Ric.
Got that.
Where is says
"Instead, the following approach is mathematically equivalent and more numerically stable:
Estimate the mixing probability P(Y=c) as the proportion of observations in the data set such that |y_i-c|<epsilon.
Estimate the parameters of the continuous distribution from the observations for which |y_i-c|>=epsilon"
How do we code that,or change my code above to do that?
(BTW,I'm doing all this for a university assignment,not for commercial purposes)
Regards
Chris
Hi Ric
I should perhaps clarify this university assignment is a "Work Placement Project", where we are expected to get outside help, as it is supposed to be an exercise in handling "real world data".
So consulting this community group is not "cheating' in any way.
Regards
Chris
What have you tried? The doc tells you exactly what you should do:
1. "Estimate the mixing probability P(Y=c) as the proportion of observations in the data set such that |y_i-c|<epsilon." This means count the number of observations for which the condition is true and divide by the total number of observations. I suggest that you create a binary indicator variable for the condition and compute the mean of the indicator variable.
2. "Estimate the parameters of the continuous distribution from the observations for which |y_i-c|>=epsilon" This says to use only the observations that do not satisfy the condition. In SAS, you can use a WHERE clause to filter the observations. If Bis the binary indicator variable from (1), then
WHERE B=0
is a WHERE clause that restricts the observations to those that do not satisfy the condition.
If you have questions, I suggest you discuss the problem with your professor.
Thanks Ric.
My professor's suggestion was to consult widely(such as this site) as finite mixture models are somewhat unfamiliar to him to!
So I'll keep playing with this.
Thanks for your help so far.
Regards
Chris
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.