'The c statistic also describes how well models can rank order cases and noncases, but is not a function of the actual predicted probabilities. For example, a model that assigns all cases a value of 0.52 and all noncases a value of 0.51 would have perfect discrimination, although the probabilities it assigns may not be helpful.'
'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'
The paper goes on to demonstrate that there is in fact a tradeoff between model discrimination ( as measured by the ROC curve) and calibration.
In this context, we may prefer a metric that is based on calibration, like the Hosmer-Lemeshaw test, but it is often criticized for sensitivity to group/category compositions, has low power in small sample sizes, and is hypersensitive and misleading in large sample sizes.
Here is one proposed solution (discussed previously):
From: NATIONAL QUALITY FORUM National Voluntary Consensus Standards for Patient Outcomes Measure Summary. (link)
“Because of the very large sample sizes studied here, a statistically significant
Hosmer-Lemeshow statistic is not considered informative with respect to calibration.”
Although the HL statistic is uninformative, model calibration could still be assessed graphically. This
could be done by comparing observed vs. predicted event rates within deciles of predicted risk."
The Assessment Score Rankings and Assessment Score Distribution tables from SAS Enterprise Miner are helpful in this regard.
Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935