Help using Base SAS procedures

Hosmer Lemeshaw Statistic too high

Reply
Occasional Contributor
Posts: 10

Hosmer Lemeshaw Statistic too high

I have a dataset with about 10 independent variables and one dichotomus dependent variable.  I have done most of the EDA on the dataset, removing extreme values, standardizing input variables, imputing missing values, testing for collinearity, etc.  Regardless of how much I clean my data, my logit model keeps failing HL goodness of fit test.  The ROC is good at .82, outliers were removed after I checked the leverage, displacements, etc. plots, and the other association stats look pretty decent.  I can't seem to figure out why HL is so bad.  I even sorted the input dataset several different ways to see if the grouping was the culprit, with no avail.  Any ideas?

Respected Advisor
Posts: 4,936

Re: Hosmer Lemeshaw Statistic too high

What does the "Partition for the Hosmer and Lemeshow Test" table look like? - PG

PG
Occasional Contributor
Posts: 10

Re: Hosmer Lemeshaw Statistic too high

                              Partition for the Hosmer and Lemeshow Test

                                             RESPONSE = 1            RESPONSE = 0
                    Group       Total    Observed    Expected    Observed    Expected

                        1      112864           0        0.32      112864    112863.7
                        2      105659        1307     1357.58      104352    104301.4
                        3      105655        3012     3226.71      102643    102428.3
                        4      105653        5595     4837.57      100058    100815.4
                        5      105654        5013     6759.97      100641    98894.03
                        6      105656        6655     9208.02       99001    96447.98
                        7      105635       11859    12363.08       93776    93271.92
                        8      105655       18983    16635.13       86672    89019.87
                        9      105656       29246    24974.59       76410    80681.41
                       10       98452       45220    47534.53       53232    50917.47


                               Hosmer and Lemeshow Goodness-of-Fit Test

                                  Chi-Square       DF     Pr > ChiSq

                                   2990.1738        8         <.0001

The spacing is off, but that's the output from SAS.  I have standardized the inputs within 3 std. deviations.  The dataset is robust with lots of observations.  But the response rate is quite low.

Respected Advisor
Posts: 4,936

Re: Hosmer Lemeshaw Statistic too high

With that many observations it is almost impossible to obtain a good fit with real life data to any distribution, the smallest discrepancy is easily detected. By looking at the table (lack of response as low prob, surplus at higher prob), I kind of guess that you would get a better fit with LINK=PROBIT. But don't expect a miracle Smiley Happy. - PG

PG
Trusted Advisor
Posts: 2,116

Re: Hosmer Lemeshaw Statistic too high

Payal,

The problem is that you are confusing statistical significance with practical significance.  When you have 1 million observations everything is "significant" because the CI are so small.  P-values are pretty much meaningless with that sample size.  You have to look at the data and determine what is "meaningful".

Doc Muhlbaier

Duke

Ask a Question
Discussion stats
  • 4 replies
  • 235 views
  • 0 likes
  • 3 in conversation