Solved: Hosmer Lemeshaw Test and Large Samples

SlutskyFan · Posted 06-30-2011 10:16 AM

I have developed several predictive (logit) models, all with sample sizes ~ 10-20K or more. Typically my model results are postive- rejecting B's = 0, no multicollinearity, significant coefficients, high AROC, high percentage of correct predictions for binary groups (0's and 1's). However, when I conduct the Hosmer Lemeshaw test (in base SAS), most of the time it is significant, indicating lack of fit.

I've read several places- (see links and refs below- but few 'hard' references) that "As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant." leading to an erroneous conclusion about model fit. Do you agree? Do you have any stronger references than I have on this? If you agree, then how large is too large when it comes to sample size? And, since SAS Enterprise Miner doesn't include the HL test (I typically use a code node to do this in EM) I'm curious if people care much about it. In fact, SAS tech support told me it was not included as an option in EM because the HL test is a holdover from the past, and basically doesn't fit the data mining paradigm (i.e. large data sets?) that SAS EM is built upon. Thanks in advance.

NCSU Faculty Web Site , the STATA stat listserv, from quality forum p. 5, another academic web page, and

JOURNAL OF PALLIATIVE MEDICINE

Volume 12, Number 2, 2009

Prediction of Pediatric Death in the Yearafter Hospitalization: A Population-Level Retrospective Cohort Study Chris Feudtner, M.D., Ph.D., M.P.H.,1,5 Kari R. Hexem, M.P.H.,1 Mayadah Shabbout, M.S.,3James A. Feinstein, M.D.,1 Julie Sochalski, Ph.D., R.N.,4,5 and Jeffery H. Silber, M.D., Ph.D.2,5

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656437/

"‘The Hosmer-Lemeshow test detected astatistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small."

SlutskyFan · Posted 06-30-2011 11:29 AM

Replying to my own post, I've found the following to be interesting and relevant to my origianl question. My sample sizes seem to be in the ranges involved in their simulations.

Crit Care Med. 2007 Sep;35(9):2052-6.

Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited.

Kramer AA, Zimmerman JE.

http://www.ncbi.nlm.nih.gov/pubmed/17568333

MEASUREMENTS AND MAIN RESULTS:

Data sets of 5,000, 10,000, and 50,000 patients were replicated 1,000 times.Logistic regression models were evaluated for each simulated data set. Thisprocess was initially carried out under conditions of perfect fit (observedmortality = predicted mortality; standardized mortality ratio = 1.000) andrepeated with an observed mortality that differed slightly (0.4%) frompredicted mortality. Under conditions of perfect fit, the Hosmer-Lemeshow testwas not influenced by the number of patients in the data set. In situationswhere there was a slight deviation from perfect fit, the Hosmer-Lemeshow testwas sensitive to sample size. For populations of 5,000 patients, 10% of theHosmer-Lemeshow tests were significant at p < .05, whereas for 10,000patients 34% of the Hosmer-Lemeshow tests were significant at p < .05. Whenthe number of patients matched contemporary studies (i.e., 50,000 patients),the Hosmer-Lemeshow test was statistically significant in 100% of the models.

CONCLUSIONS:

Caution should be used in interpreting thecalibration of predictive models developed using a smaller data set whenapplied to larger numbers of patients. A significant Hosmer-Lemeshow test doesnot necessarily mean that a predictive model is not useful or suspect. Whiledecisions concerning a mortality model's suitability should include theHosmer-Lemeshow test, additional information needs to be taken intoconsideration. This includes the overall number of patients, the observed andpredicted probabilities within each decile, and adjunct measures of modelcalibration.

View solution in original post

SlutskyFan · Posted 06-30-2011 11:29 AM

Replying to my own post, I've found the following to be interesting and relevant to my origianl question. My sample sizes seem to be in the ranges involved in their simulations.

Crit Care Med. 2007 Sep;35(9):2052-6.

Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited.

Kramer AA, Zimmerman JE.

http://www.ncbi.nlm.nih.gov/pubmed/17568333

MEASUREMENTS AND MAIN RESULTS:

Data sets of 5,000, 10,000, and 50,000 patients were replicated 1,000 times.Logistic regression models were evaluated for each simulated data set. Thisprocess was initially carried out under conditions of perfect fit (observedmortality = predicted mortality; standardized mortality ratio = 1.000) andrepeated with an observed mortality that differed slightly (0.4%) frompredicted mortality. Under conditions of perfect fit, the Hosmer-Lemeshow testwas not influenced by the number of patients in the data set. In situationswhere there was a slight deviation from perfect fit, the Hosmer-Lemeshow testwas sensitive to sample size. For populations of 5,000 patients, 10% of theHosmer-Lemeshow tests were significant at p < .05, whereas for 10,000patients 34% of the Hosmer-Lemeshow tests were significant at p < .05. Whenthe number of patients matched contemporary studies (i.e., 50,000 patients),the Hosmer-Lemeshow test was statistically significant in 100% of the models.

CONCLUSIONS:

Caution should be used in interpreting thecalibration of predictive models developed using a smaller data set whenapplied to larger numbers of patients. A significant Hosmer-Lemeshow test doesnot necessarily mean that a predictive model is not useful or suspect. Whiledecisions concerning a mortality model's suitability should include theHosmer-Lemeshow test, additional information needs to be taken intoconsideration. This includes the overall number of patients, the observed andpredicted probabilities within each decile, and adjunct measures of modelcalibration.