Re: Hosmer Lemeshaw Statistic too high

Payal · Posted 06-28-2012 02:41 PM

I have a dataset with about 10 independent variables and one dichotomus dependent variable. I have done most of the EDA on the dataset, removing extreme values, standardizing input variables, imputing missing values, testing for collinearity, etc. Regardless of how much I clean my data, my logit model keeps failing HL goodness of fit test. The ROC is good at .82, outliers were removed after I checked the leverage, displacements, etc. plots, and the other association stats look pretty decent. I can't seem to figure out why HL is so bad. I even sorted the input dataset several different ways to see if the grouping was the culprit, with no avail. Any ideas?

PGStats · Posted 06-28-2012 03:09 PM

What does the "Partition for the Hosmer and Lemeshow Test" table look like? - PG

PG

Payal · Posted 06-28-2012 03:20 PM

Partition for the Hosmer and Lemeshow Test

RESPONSE = 1 RESPONSE = 0
Group Total Observed Expected Observed Expected

                        1      112864           0        0.32      112864    112863.7
                        2      105659        1307     1357.58      104352    104301.4
                        3      105655        3012     3226.71      102643    102428.3
                        4      105653        5595     4837.57      100058    100815.4
                        5      105654        5013     6759.97      100641    98894.03
                        6      105656        6655     9208.02       99001    96447.98
                        7      105635       11859    12363.08       93776    93271.92
                        8      105655       18983    16635.13       86672    89019.87
                        9      105656       29246    24974.59       76410    80681.41
                       10       98452       45220    47534.53       53232    50917.47

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

2990.1738 8 <.0001

The spacing is off, but that's the output from SAS. I have standardized the inputs within 3 std. deviations. The dataset is robust with lots of observations. But the response rate is quite low.

PGStats · Posted 06-28-2012 05:04 PM

With that many observations it is almost impossible to obtain a good fit with real life data to any distribution, the smallest discrepancy is easily detected. By looking at the table (lack of response as low prob, surplus at higher prob), I kind of guess that you would get a better fit with LINK=PROBIT. But don't expect a miracle . - PG

PG

Doc_Duke · Posted 06-28-2012 05:07 PM

Payal,

The problem is that you are confusing statistical significance with practical significance. When you have 1 million observations everything is "significant" because the CI are so small. P-values are pretty much meaningless with that sample size. You have to look at the data and determine what is "meaningful".

Doc Muhlbaier

Duke

Hosmer Lemeshaw Statistic too high