I have a dataset with about 10 independent variables and one dichotomus dependent variable. I have done most of the EDA on the dataset, removing extreme values, standardizing input variables, imputing missing values, testing for collinearity, etc. Regardless of how much I clean my data, my logit model keeps failing HL goodness of fit test. The ROC is good at .82, outliers were removed after I checked the leverage, displacements, etc. plots, and the other association stats look pretty decent. I can't seem to figure out why HL is so bad. I even sorted the input dataset several different ways to see if the grouping was the culprit, with no avail. Any ideas?
What does the "Partition for the Hosmer and Lemeshow Test" table look like? - PG
Partition for the Hosmer and Lemeshow Test
RESPONSE = 1 RESPONSE = 0
Group Total Observed Expected Observed Expected
1 112864 0 0.32 112864 112863.7
2 105659 1307 1357.58 104352 104301.4
3 105655 3012 3226.71 102643 102428.3
4 105653 5595 4837.57 100058 100815.4
5 105654 5013 6759.97 100641 98894.03
6 105656 6655 9208.02 99001 96447.98
7 105635 11859 12363.08 93776 93271.92
8 105655 18983 16635.13 86672 89019.87
9 105656 29246 24974.59 76410 80681.41
10 98452 45220 47534.53 53232 50917.47
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
2990.1738 8 <.0001
The spacing is off, but that's the output from SAS. I have standardized the inputs within 3 std. deviations. The dataset is robust with lots of observations. But the response rate is quite low.
With that many observations it is almost impossible to obtain a good fit with real life data to any distribution, the smallest discrepancy is easily detected. By looking at the table (lack of response as low prob, surplus at higher prob), I kind of guess that you would get a better fit with LINK=PROBIT. But don't expect a miracle . - PG
Payal,
The problem is that you are confusing statistical significance with practical significance. When you have 1 million observations everything is "significant" because the CI are so small. P-values are pretty much meaningless with that sample size. You have to look at the data and determine what is "meaningful".
Doc Muhlbaier
Duke
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.