06-15-2017 12:06 PM - edited 06-15-2017 12:09 PM
I have a dataset of yearly historical data containing independent variables and also claim occurence (y/n), frequency/num of claims during that year (if occured) and average amount per claim(if occured). These last three are dependent variables. Also, claims occured only in very small number of cases (6-7%). According to my internet research, claim frequency usually follows poisson distribution and claim amount gamma distribution. However, this seems to be not my case, because I tried using HP GLM node in Enterprise Miner 14.1 with several options - poisson, negative binomial, zero inflated poisson and zero inflated negative binomial for claim frequency and gamma distribution for claim amount. In both cases I used log link function. I tried also using interaction and polynomial terms. Moreover I tried different selection procedures - backward and stepwise.
The resulting models seem not good at all. Those predicting claim frequency will always predict a number close to zero (the highest is around 0.30 - how can it even predict decimal values when the dependent variable is supposed to be integer?) and those predicting claim amount will predict very wrong, and on average the predictions are lower then the real values.
Could you please help me find what am I missing? Should I do undersampling in order to increase the occurence of claims, before fitting the models? Am I setting something wrong for the HP GLM node? Should I leave those GLMs alltogether and try different predictive models? I understand that I could use classification models for predicting claim occurence, but I have no idea what other models could be used to predict claim frequency (number between 0 and 3) and claim amount (when I ignore zero values, the rest has log-normal distribution, could I leverage that somehow? Otherwise, I can't identify proper distribution when taking into account zero values - it's not gamma as it usually is).
Any input would be much appreciated, thanks a lot