BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SlutskyFan
Obsidian | Level 7

I have developed several predictive (logit) models, all with sample sizes ~ 10-20K or more. Typically my model results are postive- rejecting B's = 0, no multicollinearity, significant coefficients, high AROC, high percentage of correct predictions for binary groups (0's and 1's). However, when I conduct the Hosmer Lemeshaw test (in base SAS), most of the time it is significant, indicating lack of fit. 

I've read several places- (see links and refs below- but few 'hard' references) that  "As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant." leading to an erroneous conclusion about model fit. Do you agree? Do you have any stronger references than I have on this? If you agree, then how large is too large when it comes to sample size? And, since SAS Enterprise Miner doesn't include the HL test (I typically use a code node to do this in EM) I'm curious if people care much about it. In fact, SAS tech support told me it was not included  as an option in EM because the HL test is a holdover from the past, and basically doesn't fit the data mining paradigm (i.e. large data sets?)  that SAS EM is built upon.  Thanks in advance.

NCSU Faculty Web Site , the STATA stat listserv,  from quality forum p. 5, another academic web page, and 

JOURNAL OF PALLIATIVE MEDICINE

Volume 12, Number 2, 2009

Prediction of Pediatric Death in the Yearafter Hospitalization: A Population-Level  Retrospective Cohort Study Chris Feudtner, M.D., Ph.D., M.P.H.,1,5 Kari R. Hexem, M.P.H.,1 Mayadah Shabbout, M.S.,3James A. Feinstein, M.D.,1 Julie Sochalski, Ph.D., R.N.,4,5 and Jeffery H. Silber, M.D., Ph.D.2,5

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656437/ 

"‘The Hosmer-Lemeshow test detected astatistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small."


1 ACCEPTED SOLUTION

Accepted Solutions
SlutskyFan
Obsidian | Level 7

Replying to my own post, I've found the following to be interesting and relevant to my origianl question. My sample sizes seem to be in the ranges involved in their simulations.

Crit Care Med. 2007 Sep;35(9):2052-6.

Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited.

Kramer AA, Zimmerman JE.

http://www.ncbi.nlm.nih.gov/pubmed/17568333

MEASUREMENTS AND MAIN RESULTS:

Data sets of 5,000, 10,000, and 50,000 patients were replicated 1,000 times.Logistic regression models were evaluated for each simulated data set. Thisprocess was initially carried out under conditions of perfect fit (observedmortality = predicted mortality; standardized mortality ratio = 1.000) andrepeated with an observed mortality that differed slightly (0.4%) frompredicted mortality. Under conditions of perfect fit, the Hosmer-Lemeshow testwas not influenced by the number of patients in the data set. In situationswhere there was a slight deviation from perfect fit, the Hosmer-Lemeshow testwas sensitive to sample size. For populations of 5,000 patients, 10% of theHosmer-Lemeshow tests were significant at p < .05, whereas for 10,000patients 34% of the Hosmer-Lemeshow tests were significant at p < .05. Whenthe number of patients matched contemporary studies (i.e., 50,000 patients),the Hosmer-Lemeshow test was statistically significant in 100% of the models.

CONCLUSIONS:

Caution should be used in interpreting thecalibration of predictive models developed using a smaller data set whenapplied to larger numbers of patients. A significant Hosmer-Lemeshow test doesnot necessarily mean that a predictive model is not useful or suspect. Whiledecisions concerning a mortality model's suitability should include theHosmer-Lemeshow test, additional information needs to be taken intoconsideration. This includes the overall number of patients, the observed andpredicted probabilities within each decile, and adjunct measures of modelcalibration.

View solution in original post

1 REPLY 1
SlutskyFan
Obsidian | Level 7

Replying to my own post, I've found the following to be interesting and relevant to my origianl question. My sample sizes seem to be in the ranges involved in their simulations.

Crit Care Med. 2007 Sep;35(9):2052-6.

Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited.

Kramer AA, Zimmerman JE.

http://www.ncbi.nlm.nih.gov/pubmed/17568333

MEASUREMENTS AND MAIN RESULTS:

Data sets of 5,000, 10,000, and 50,000 patients were replicated 1,000 times.Logistic regression models were evaluated for each simulated data set. Thisprocess was initially carried out under conditions of perfect fit (observedmortality = predicted mortality; standardized mortality ratio = 1.000) andrepeated with an observed mortality that differed slightly (0.4%) frompredicted mortality. Under conditions of perfect fit, the Hosmer-Lemeshow testwas not influenced by the number of patients in the data set. In situationswhere there was a slight deviation from perfect fit, the Hosmer-Lemeshow testwas sensitive to sample size. For populations of 5,000 patients, 10% of theHosmer-Lemeshow tests were significant at p < .05, whereas for 10,000patients 34% of the Hosmer-Lemeshow tests were significant at p < .05. Whenthe number of patients matched contemporary studies (i.e., 50,000 patients),the Hosmer-Lemeshow test was statistically significant in 100% of the models.

CONCLUSIONS:

Caution should be used in interpreting thecalibration of predictive models developed using a smaller data set whenapplied to larger numbers of patients. A significant Hosmer-Lemeshow test doesnot necessarily mean that a predictive model is not useful or suspect. Whiledecisions concerning a mortality model's suitability should include theHosmer-Lemeshow test, additional information needs to be taken intoconsideration. This includes the overall number of patients, the observed andpredicted probabilities within each decile, and adjunct measures of modelcalibration.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 7834 views
  • 1 like
  • 1 in conversation