I have a dataset with about 10 independent variables and one dichotomus dependent variable. I have done most of the EDA on the dataset, removing extreme values, standardizing input variables, imputing missing values, testing for collinearity, etc. Regardless of how much I clean my data, my logit model keeps failing HL goodness of fit test. The ROC is good at .82, outliers were removed after I checked the leverage, displacements, etc. plots, and the other association stats look pretty decent. I can't seem to figure out why HL is so bad. I even sorted the input dataset several different ways to see if the grouping was the culprit, with no avail. Any ideas?
What does the "Partition for the Hosmer and Lemeshow Test" table look like? - PG
Partition for the Hosmer and Lemeshow Test
RESPONSE = 1 RESPONSE = 0
Group Total Observed Expected Observed Expected
1 112864 0 0.32 112864 112863.7
2 105659 1307 1357.58 104352 104301.4
3 105655 3012 3226.71 102643 102428.3
4 105653 5595 4837.57 100058 100815.4
5 105654 5013 6759.97 100641 98894.03
6 105656 6655 9208.02 99001 96447.98
7 105635 11859 12363.08 93776 93271.92
8 105655 18983 16635.13 86672 89019.87
9 105656 29246 24974.59 76410 80681.41
10 98452 45220 47534.53 53232 50917.47
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
2990.1738 8 <.0001
The spacing is off, but that's the output from SAS. I have standardized the inputs within 3 std. deviations. The dataset is robust with lots of observations. But the response rate is quite low.
With that many observations it is almost impossible to obtain a good fit with real life data to any distribution, the smallest discrepancy is easily detected. By looking at the table (lack of response as low prob, surplus at higher prob), I kind of guess that you would get a better fit with LINK=PROBIT. But don't expect a miracle . - PG
Payal,
The problem is that you are confusing statistical significance with practical significance. When you have 1 million observations everything is "significant" because the CI are so small. P-values are pretty much meaningless with that sample size. You have to look at the data and determine what is "meaningful".
Doc Muhlbaier
Duke
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.