In a previous post I discussed the use of the ROC curve for model assessment. The ROC curve is a metric used to determine discrimination, how well a model discriminates between classes. Sometimes we are more interested in calibration, or the accurate stratification of individuals into higher or lower risk categories or risk strata. In other words, sometimes we want to take predicted probabilities and divide our scored data set into groups, each having differing average predicted probabilities. A well calibrated model will make it possible to sort observations into strata or segments that exhibit an actual average outcome rate very close to the average predicted rate for the group. For example, the results below seem to indicate a well calibrated model for binary response Y ~ (0,1):
One might remark, these results look great, but are they statistically significant? The Hosmer-Lemeshaw test is a natural way to test for model calibration, and tests that very hypothesis. Without getting into great detail, the HL test is a chi-squared based test that divides your data set into deciles based on scores or predicted values and compares the observed response rates to the expected response rates (determined by your model's predicted probability of response). Significant differences between observed and predicted outcomes indicate lack of fit, so the HL test is a test for lack of fit i.e. significant values indicate lack of fit.
There are criticisms of the HL test, primarily related to its low power in small sample sizes, sensitivity to groupings, and its sensitivity to large sample sizes. Below are some examples of this last criticism:
JOURNAL OF PALLIATIVE MEDICINE
Volume 12, Number 2, 2009
Prediction of Pediatric Death in the Yearafter Hospitalization: A Population-Level Retrospective Cohort Study Chris Feudtner, M.D., Ph.D., M.P.H.,1,5 Kari R. Hexem, M.P.H.,1 Mayadah Shabbout, M.S.,3James A. Feinstein, M.D.,1 Julie Sochalski, Ph.D., R.N.,4,5 and Jeffery H. Silber, M.D., Ph.D.2,5
‘The Hosmer-Lemeshow test detected a statistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small.’ N = 600,000
Crit Care Med. 2007 Sep;35(9):2052-6.
Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited.
'Caution should be used in interpreting the calibration of predictive models developed using a smaller data set when applied to larger numbers of patients. A significant Hosmer-Lemeshow test does not necessarily mean that a predictive model is not useful or suspect. While decisions concerning a mortality model's suitability should include the Hosmer-Lemeshow test, additional information needs to be taken into consideration. This includes the overall number of patients, the observed and predicted probabilities within each decile, and adjunct measures of model calibration.'
Here we also have a cogent discussion of the HL test and its faults in the face of large sample sizes:
"A disadvantage of this goodness of fit measure is that it is a significance test, with all the limitations this entails. Like other significant tests it only tells us whether the model fits or not, and does not tell us anything about the extent of the fit. Similarly, like other significance tests, it is strongly influenced by the sample size (sample size and effect size both determine significance), and in large samples, such as the PISA dataset we are using here, a very small difference will lead to significance. As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant. Small sample sizes are also problematic, however, as, being a Chi Square test we can’t have too many groups (more than 10%) with predicted frequencies of less than five."
What should we do in the face of large sample sizes and hypersensitive HL tests? Here is one suggestion: (link)
From: NATIONAL QUALITY FORUM National Voluntary Consensus Standards for Patient Outcomes Measure Summary.
“Because of the very large sample sizes studied here, a statistically significant
Hosmer-Lemeshow statistic is not considered informative with respect to calibration.”
Although the HL statistic is uninformative, model calibration could still be assessed graphically. This
could be done by comparing observed vs. predicted event rates within deciles of predicted risk."