Tuesday, April 2, 2013

Is the ROC curve a good metric for model calibration?

I previously discussed the use of the ROC curve as a tool for model assessment, particularly as a metric for discrimination. I stated that this metric (particularly the area under the ROC curve or c-statistic) is used increasingly in the machine learning community and is preferred over other measures of fit like precision or the F1-Score because it evaluates model performance across all considered cutoff values vs. an arbitrarily chosen cutoff (Bradley, 1997). I still prefer this metric over a metric based on an arbitrary cutoff (like percentage of correct predictions, precision, recall or the F-1 score). However, if the goal is to use your predictive model to stratify your score data into groups (like in a market segmentation application or this example) then the ROC curve may not be the absolute best metric. The metric we are actually after is one that assesses model calibration (as discussed here).  In the article 'Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction' we get several criticisms of using the ROC curve in this context:

'The c statistic also describes how well models can rank order cases and noncases, but is not a function of the actual predicted probabilities. For example, a model that assigns all cases a value of 0.52 and all noncases a value of 0.51 would have perfect discrimination, although the probabilities it assigns may not be helpful.'

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

The paper goes on to demonstrate that there is in fact a tradeoff between model discrimination ( as measured by the ROC curve) and calibration.

In this context, we may prefer a metric that is based on calibration, like the Hosmer-Lemeshaw test, but it is often criticized for sensitivity to group/category compositions, has low power in small sample sizes, and is hypersensitive and misleading in large sample sizes. 

Here is one proposed solution (discussed previously):

From: NATIONAL QUALITY FORUM National Voluntary Consensus Standards for Patient Outcomes Measure Summary. (link)

“Because of the very large sample sizes studied here, a statistically significant
Hosmer-Lemeshow statistic is not considered informative with respect to calibration.”
Although the HL statistic is uninformative, model calibration could still be assessed graphically. This
could be done by comparing observed vs. predicted event rates within deciles of predicted risk."

The Assessment Score Rankings and Assessment Score Distribution tables from SAS Enterprise Miner are helpful in this regard.

Additional References:
Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935

No comments:

Post a Comment