Wednesday, September 30, 2020

Calibration, Discrimination, and Ethics

Classification models with binary and categorical outcomes are often assessed based on the c-statistic or area under the ROC curve. (see also:

This metric ranges between 0 and 1 and provides a summary of model performance in terms of its ability to rank observations. For example, if a model is developed to predict the probability of default, the area under the ROC curve can be interpreted as the probability that a randomly chosen observation from the observed default class will be ranked higher (based on model predictions or probability) than a chosen observation from the observed non-default class (Provost and Fawcett, 2013). This metric is not without criticism and should not be used as the exclusive criteria for model assessment in all cases. As argued by Cook (2017):

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

Calibration is an alternative metric for model assessment. Calibration measures the agreement between observed and predicted risk or the closeness of model predicted probability to the underlying probability of the population under study. Both discrimination and calibration are included in the National Quality Forum’s Measure of Evaluation Criteria. However, many have noted that calibration is largely underutilized by practitioners in the data science and predictive modeling communities (Walsh et al., 2017; Van Calster et al., 2019). Models that perform well on the basis of discrimination (area under the ROC) may not perform well based on calibration (Cook,2017). And in fact a model with lower ROC scores could actually calibrate better than a model with higher ROC scores (Van Calster et al., 2019). This can lead to ethical concerns as lack of calibration in predictive models can in application result in decisions that lead to over or under utilization of resources (Van Calster et al, 2019).

Others have argued there are ethical considerations as well:

“Rigorous calibration of prediction is important for model optimization, but also ultimately crucial for medical ethics. Finally, the amelioration and evolution of ML methodology is about more than just technical issues: it will require vigilance for our own human biases that makes us see only what we want to see, and keep us from thinking critically and acting consistently.” (Levy, 2020)

Van Calster et al. (2019), Colin et al. (2017) and Steyerberg et al. (2010) provide guidance on ways of assessing model calibration.

Frank Harrel provides a great discussion about choosing the correct metrics for model assessment along with a wealth of resources here.


Matrix of Confusion. Drew Griffin Levy, PhD. GoodScience, Inc.  Accessed 9/22/2020

Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. Tom Fawcett.O’Reilly. CA. 2013.

Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2

Colin G. Walsh, Kavya Sharman, George Hripcsak, Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk, Journal of Biomedical Informatics, Volume 76, 2017, Pages 9-18, ISSN 1532-0464,

 Van Calster, B., McLernon, D.J., van Smeden, M. et al. Calibration: the Achilles heel of predictive analytics. BMC Med 17, 230 (2019).

Wednesday, September 2, 2020

Blocking and Causality

In a previous post I discussed block randomized designs. 

Duflo et al (2008) describe this in more detail:

"Since the covariates to be used must be chosen in advance in order to avoid specification searching and data mining, they can be used to stratify (or block) the sample in order to improve the precision of estimates. This technique (¯rst proposed by Fisher (1926)) involves dividing the sample into groups sharing the same or similar values of certain observable characteristics. The randomization ensures that treatment and control groups will be similar in expectation. But stratification is used to ensure that along important observable dimensions this is also true in practice in the sample....blocking is more efficient than controlling ex post for these variables, since it ensures an equal proportion of treated and untreated unit within each block and therefore minimizes variance."

They also elaborate on blocking when you are interested in subgroup analysis:

"Apart from reducing variance, an important reason to adopt a stratified design is when the researchers are interested in the effect of the program on specific subgroups. If one is interested in the effect of the program on a sub-group, the experiment must have enough power for this subgroup (each sub-group constitutes in some sense a distinct experiment). Stratification according to those subgroups then ensure that the ratio between treatment and control units is determined by the experimenter in each sub-group, and can therefore be chosen optimally. It is also an assurance for the reader that the sub-group analysis was planned in advance."

Dijkman et al (2009) discuss subgroup analysis in blocked or stratified designs in more detail:

"When stratification of randomization is based on subgroup variables, it is more likely that treatment assignments within subgroups are balanced, making each subgroup a small trial. Because randomization makes it likely for the subgroups to be similar in all aspects except treatment, valid inferences about treatment efficacy within subgroups are likely to be drawn. In post hoc subgroup analyses, the subgroups are often incomparable because no stratified randomization is performed. Additionally, stratified randomization is desirable since it forces researchers to define subgroups before the start of the study."

Both of these accounts seem very much consistent with each other in terms of thinking about randomization within subgroups creating a mini trial where causal inferences can be drawn. But I think the key thing to consider is they are referring to comparisons made WITHIN sub groups and not necessarily BETWEEN subgroups. 

Gerber and Green discuss this in one of their chapters on analysis of block randomized experiments :

"Regardless of whether one controls for blocks using weighted regression or regression with indicators for blocks, they key principle is to compare treatment and control subjects within blocks, not between blocks."

When we start to compare treatment and control units BETWEEN blocks or subgroups we are essentially interpreting covariates and this cannot be done with a causal interpretation. Green and Gerber discuss an example related to differences in the performance of Hindu vs. Muslim schools. 

"it could just be that religion is a marker for a host of unmeasured attributes that are correlated with educational outcomes. The set of covariates included in an experimental analysis need not be a complete list of factors that affect outcomes: the fact that some factors are left out or poorly measured is not a source of bias when the aim is to measure the average treatment effect of the random intervention. Omitted variables and mismeasurement, however, can lead to sever bias if the aim is to draw causal inferences about the effects of covariates. Causal interpretation of the covariates encounters all of the threats to inference associated with analysis of observational data."

In other words, these kinds of comparisons face the the same challenges related to interpreting control variables in a regression in an observational setting (see Keele, 2020). 

But why doesn't randomization within religion allow us to make causal statements about these comparisons? Let's think about a different example. Suppose we wanted to measure treatment effects for some kind of educational intervention and we were interested in subgroup differences in the outcome between public and private high schools. We could randomly assign treatments and controls within the public school population and do the same within the private school population. We know overall treatment effects would be unbiased because the school type would be perfectly balanced (instead of balanced just on average in a completely random design) and we would expect all other important confounders to be balanced between treatments and controls on average. 

We also know that within the group of private schools the treatment and controls should at least on average be balanced for certain confounders (median household income, teacher's education/training/experience, and perhaps an unobservable confounder related to student motivation). 

We could say the same thing about comparisons WITHIN the subgroup of public schools. But there is no reason to believe that the treated students in private schools would be comparable to the treated students in public schools because there is no reason to expect that important confounders would be balanced when making the comparisons. 

Assume we are looking at differences in first semester college GPA. Maybe within the private subgroup we find that treated treated students on average have a first semester college GPA that is .25 points higher the comparable control group. But within the public school subgroup, this differences was only .10. We can say that there is a difference in outcomes of .15 points between groups but can we say this is causal? Is the difference really related to school type or is school type really a proxy for income, teacher quality, or motivation? If we increased motivation or income in the public schools would that make up the difference? We might do better if our design originally stratified on all of these important confounders like income and teacher education. Then we could compare students in both public and private schools with similar family incomes and teachers of similar credentials. But...there is no reason to believe that student motivation would be balanced. We can't block or stratify on an unobservable confounder. Again, as Gerber and Green state, we find ourselves in a world that borders between experimental and non-experimental methods. Simply, the subgroups defined by any particular covariate that itself is not or cannot be randomly assigned may have different potential outcomes. What we can say from these results is that school type predicts the outcome but does not necessarily cause it.

Gerber and Green expound on this idea:

"Subgroup analysis should be thought of as exploratory or descriptive analysis....if the aim is simply to predict when treatment effects will be large, the researcher need not have a correctly specified causal model that explains treatment effects (see to explain or predict)....noticing that treatment effects tend to be large in some groups and absent from others can provide important clues about why treatments work. But resist the temptation to think subgroup differences establish the causal effect of randomly varying one's subgroup attributes."


Dijkman B, Kooistra B, Bhandari M; Evidence-Based Surgery Working Group. How to work with a subgroup analysis. Can J Surg. 2009;52(6):515-522. 

Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton

Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31