Wednesday, September 30, 2020

Calibration, Discrimination, and Ethics

Classification models with binary and categorical outcomes are often assessed based on the c-statistic or area under the ROC curve. (see also:

This metric ranges between 0 and 1 and provides a summary of model performance in terms of its ability to rank observations. For example, if a model is developed to predict the probability of default, the area under the ROC curve can be interpreted as the probability that a randomly chosen observation from the observed default class will be ranked higher (based on model predictions or probability) than a chosen observation from the observed non-default class (Provost and Fawcett, 2013). This metric is not without criticism and should not be used as the exclusive criteria for model assessment in all cases. As argued by Cook (2017):

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

Calibration is an alternative metric for model assessment. Calibration measures the agreement between observed and predicted risk or the closeness of model predicted probability to the underlying probability of the population under study. Both discrimination and calibration are included in the National Quality Forum’s Measure of Evaluation Criteria. However, many have noted that calibration is largely underutilized by practitioners in the data science and predictive modeling communities (Walsh et al., 2017; Van Calster et al., 2019). Models that perform well on the basis of discrimination (area under the ROC) may not perform well based on calibration (Cook,2017). And in fact a model with lower ROC scores could actually calibrate better than a model with higher ROC scores (Van Calster et al., 2019). This can lead to ethical concerns as lack of calibration in predictive models can in application result in decisions that lead to over or under utilization of resources (Van Calster et al, 2019).

Others have argued there are ethical considerations as well:

“Rigorous calibration of prediction is important for model optimization, but also ultimately crucial for medical ethics. Finally, the amelioration and evolution of ML methodology is about more than just technical issues: it will require vigilance for our own human biases that makes us see only what we want to see, and keep us from thinking critically and acting consistently.” (Levy, 2020)

Van Calster et al. (2019), Colin et al. (2017) and Steyerberg et al. (2010) provide guidance on ways of assessing model calibration.

Frank Harrel provides a great discussion about choosing the correct metrics for model assessment along with a wealth of resources here.


Matrix of Confusion. Drew Griffin Levy, PhD. GoodScience, Inc.  Accessed 9/22/2020

Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. Tom Fawcett.O’Reilly. CA. 2013.

Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2

Colin G. Walsh, Kavya Sharman, George Hripcsak, Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk, Journal of Biomedical Informatics, Volume 76, 2017, Pages 9-18, ISSN 1532-0464,

 Van Calster, B., McLernon, D.J., van Smeden, M. et al. Calibration: the Achilles heel of predictive analytics. BMC Med 17, 230 (2019).

No comments:

Post a Comment