BookmarkSubscribeRSS Feed
Payal
Calcite | Level 5

I have a dataset with about 10 independent variables and one dichotomus dependent variable.  I have done most of the EDA on the dataset, removing extreme values, standardizing input variables, imputing missing values, testing for collinearity, etc.  Regardless of how much I clean my data, my logit model keeps failing HL goodness of fit test.  The ROC is good at .82, outliers were removed after I checked the leverage, displacements, etc. plots, and the other association stats look pretty decent.  I can't seem to figure out why HL is so bad.  I even sorted the input dataset several different ways to see if the grouping was the culprit, with no avail.  Any ideas?

4 REPLIES 4
PGStats
Opal | Level 21

What does the "Partition for the Hosmer and Lemeshow Test" table look like? - PG

PG
Payal
Calcite | Level 5

                              Partition for the Hosmer and Lemeshow Test

                                             RESPONSE = 1            RESPONSE = 0
                    Group       Total    Observed    Expected    Observed    Expected

                        1      112864           0        0.32      112864    112863.7
                        2      105659        1307     1357.58      104352    104301.4
                        3      105655        3012     3226.71      102643    102428.3
                        4      105653        5595     4837.57      100058    100815.4
                        5      105654        5013     6759.97      100641    98894.03
                        6      105656        6655     9208.02       99001    96447.98
                        7      105635       11859    12363.08       93776    93271.92
                        8      105655       18983    16635.13       86672    89019.87
                        9      105656       29246    24974.59       76410    80681.41
                       10       98452       45220    47534.53       53232    50917.47


                               Hosmer and Lemeshow Goodness-of-Fit Test

                                  Chi-Square       DF     Pr > ChiSq

                                   2990.1738        8         <.0001

The spacing is off, but that's the output from SAS.  I have standardized the inputs within 3 std. deviations.  The dataset is robust with lots of observations.  But the response rate is quite low.

PGStats
Opal | Level 21

With that many observations it is almost impossible to obtain a good fit with real life data to any distribution, the smallest discrepancy is easily detected. By looking at the table (lack of response as low prob, surplus at higher prob), I kind of guess that you would get a better fit with LINK=PROBIT. But don't expect a miracle Smiley Happy. - PG

PG
Doc_Duke
Rhodochrosite | Level 12

Payal,

The problem is that you are confusing statistical significance with practical significance.  When you have 1 million observations everything is "significant" because the CI are so small.  P-values are pretty much meaningless with that sample size.  You have to look at the data and determine what is "meaningful".

Doc Muhlbaier

Duke

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1043 views
  • 0 likes
  • 3 in conversation