In a previous post I discussed block randomized designs.
Duflo et al (2008) describe this in more detail:
"Since the covariates to be used must be chosen in advance in order to avoid specification searching and data mining, they can be used to stratify (or block) the sample in order to improve the precision of estimates. This technique (¯rst proposed by Fisher (1926)) involves dividing the sample into groups sharing the same or similar values of certain observable characteristics. The randomization ensures that treatment and control groups will be similar in expectation. But stratification is used to ensure that along important observable dimensions this is also true in practice in the sample....blocking is more efficient than controlling ex post for these variables, since it ensures an equal proportion of treated and untreated unit within each block and therefore minimizes variance."
They also elaborate on blocking when you are interested in subgroup analysis:
"Apart from reducing variance, an important reason to adopt a stratified design is when the researchers are interested in the effect of the program on specific subgroups. If one is interested in the effect of the program on a sub-group, the experiment must have enough power for this subgroup (each sub-group constitutes in some sense a distinct experiment). Stratification according to those subgroups then ensure that the ratio between treatment and control units is determined by the experimenter in each sub-group, and can therefore be chosen optimally. It is also an assurance for the reader that the sub-group analysis was planned in advance."
Dijkman et al (2009) discuss subgroup analysis in blocked or stratified designs in more detail:
"When stratification of randomization is based on subgroup variables, it is more likely that treatment assignments within subgroups are balanced, making each subgroup a small trial. Because randomization makes it likely for the subgroups to be similar in all aspects except treatment, valid inferences about treatment efficacy within subgroups are likely to be drawn. In post hoc subgroup analyses, the subgroups are often incomparable because no stratified randomization is performed. Additionally, stratified randomization is desirable since it forces researchers to define subgroups before the start of the study."
Both of these accounts seem very much consistent with each other in terms of thinking about randomization within subgroups creating a mini trial where causal inferences can be drawn. But I think the key thing to consider is they are referring to comparisons made WITHIN sub groups and not necessarily BETWEEN subgroups.
Gerber and Green discuss this in one of their chapters on analysis of block randomized experiments :
"Regardless of whether one controls for blocks using weighted regression or regression with indicators for blocks, they key principle is to compare treatment and control subjects within blocks, not between blocks."
When we start to compare treatment and control units BETWEEN blocks or subgroups we are essentially interpreting covariates and this cannot be done with a causal interpretation. Green and Gerber discuss an example related to differences in the performance of Hindu vs. Muslim schools.
"it could just be that religion is a marker for a host of unmeasured attributes that are correlated with educational outcomes. The set of covariates included in an experimental analysis need not be a complete list of factors that affect outcomes: the fact that some factors are left out or poorly measured is not a source of bias when the aim is to measure the average treatment effect of the random intervention. Omitted variables and mismeasurement, however, can lead to sever bias if the aim is to draw causal inferences about the effects of covariates. Causal interpretation of the covariates encounters all of the threats to inference associated with analysis of observational data."
In other words, these kinds of comparisons face the the same challenges related to interpreting control variables in a regression in an observational setting (see Keele, 2020).
But why doesn't randomization within religion allow us to make causal statements about these comparisons? Let's think about a different example. Suppose we wanted to measure treatment effects for some kind of educational intervention and we were interested in subgroup differences in the outcome between public and private high schools. We could randomly assign treatments and controls within the public school population and do the same within the private school population. We know overall treatment effects would be unbiased because the school type would be perfectly balanced (instead of balanced just on average in a completely random design) and we would expect all other important confounders to be balanced between treatments and controls on average.
We also know that within the group of private schools the treatment and controls should at least on average be balanced for certain confounders (median household income, teacher's education/training/experience, and perhaps an unobservable confounder related to student motivation).
We could say the same thing about comparisons WITHIN the subgroup of public schools. But there is no reason to believe that the treated students in private schools would be comparable to the treated students in public schools because there is no reason to expect that important confounders would be balanced when making the comparisons.
Assume we are looking at differences in first semester college GPA. Maybe within the private subgroup we find that treated treated students on average have a first semester college GPA that is .25 points higher the comparable control group. But within the public school subgroup, this differences was only .10. We can say that there is a difference in outcomes of .15 points between groups but can we say this is causal? Is the difference really related to school type or is school type really a proxy for income, teacher quality, or motivation? If we increased motivation or income in the public schools would that make up the difference? We might do better if our design originally stratified on all of these important confounders like income and teacher education. Then we could compare students in both public and private schools with similar family incomes and teachers of similar credentials. But...there is no reason to believe that student motivation would be balanced. We can't block or stratify on an unobservable confounder. Again, as Gerber and Green state, we find ourselves in a world that borders between experimental and non-experimental methods. Simply, the subgroups defined by any particular covariate that itself is not or cannot be randomly assigned may have different potential outcomes. What we can say from these results is that school type predicts the outcome but does not necessarily cause it.
Gerber and Green expound on this idea:
"Subgroup analysis should be thought of as exploratory or descriptive analysis....if the aim is simply to predict when treatment effects will be large, the researcher need not have a correctly specified causal model that explains treatment effects (see to explain or predict)....noticing that treatment effects tend to be large in some groups and absent from others can provide important clues about why treatments work. But resist the temptation to think subgroup differences establish the causal effect of randomly varying one's subgroup attributes."
Dijkman B, Kooistra B, Bhandari M; Evidence-Based Surgery Working Group. How to work with a subgroup analysis. Can J Surg. 2009;52(6):515-522.
Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.
Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton
Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31