This understanding of matching often gets lost among practitioners and it is evident in attempts to use statistical significance testing (like t-tests) to assess baseline differences in covariates between treatment and control groups . This is often (mistakenly) done as a means to (1) determine which variables to match on and (2) determine if appropriate balance has been achieved after matching.

Stuart (2010) discusses this:

*"Although common, hypothesis tests and p-values that incorporate information on the sample size (e.g., t-tests) should not be used as measures of balance, for two main reasons (Austin, 2007; Imai et al., 2008). First, balance is inherently an in-sample property, without reference to any broader population or super-population. Second, hypothesis tests can be misleading as measures of balance, because they often conflate changes in balance with changes in statistical power. Imai et al. (2008) show an example where randomly discarding control individuals seemingly leads to increased balance, simply because of the reduced power."*

Imai et al. (2008) elaborate. Using simulation they demonstrate that:

*"The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics that are used in applied research. For example, the same simulation applied to the Kolmogorov–Smirnov test shows that its p-value monotonically increases as we randomly drop more control units. This is because a smaller sample size typically produces less statistical power and hence a larger p-value"*

and

*"from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant in this context"*

Austin (2009) has a paper devoted completely to balance diagnostics for propensity score matching (absolute standardized differences are recommended as an alternative to using significance tests).

OK so based on this view of matching as a data pre-processing step in an observational setting, using hypothesis tests and p-values to assess balance doesn't seem to make sense. But what about randomized controlled trials and randomized field trials? In those cases randomization is used as a means to achieve balance outright instead of matching after the fact in an observational setting. Even better, we hope to achieve balance on unobservable confounders that we could never measure or match on. But sometimes randomization isn't perfect in this regard, especially in smaller samples. So we still may want to investigate treatment and control covariate balance in this setting in order to (1) identify potential issues with randomization (2) statistically control for any chance imbalances.

Altman (1985) discusses the implication of using significance tests to assess balance in randomized clinical trials:

*"Randomised allocation in a clinical trial does not guarantee that the treatment groups are comparable with respect to baseline characteristics. It is common for differences between treatment groups to be assessed by significance tests but such tests only assess the correctness of the randomisation, not whether any observed imbalances between the groups might have affected the results of the trial. In particular, it is quite unjustified to conclude that variables that are not significantly differently distributed between groups cannot have affected the results of the trial."*

*"The possible effect of imbalance in a prognostic factor is considered, and it is shown that non‐significant imbalances can exert a strong influence on the observed result of the trial, even when the risk associated with the factor is not all that great."*

Even though this was in the context of an RCT and not an observational study, this seems to parallel the simulation results from Imai et al. (2008). For some reason, Altman made me chuckle when I read this:

*"Putting these two ideas together, performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a*

*procedure is clearly absurd."*

More recent discussions include Egbewale (2015) and also Pocock et al. (2002) who found that nearly 50% of practitioners were still employing significance testing to assess covariate balance in randomized trials.

So if using significance tests for balance assessment in matched and randomized studies is so 1985....why are we still doing it?

**References:**

Altman, D.G. (1985), Comparability of Randomised Groups. Journal of the Royal Statistical Society: Series D (The Statistician), 34: 125-136. doi:10.2307/2987510

Austin, PC. Balance diagnostics for comparing the distribution of baseline

covariates between treatment groups in propensity-score

matched sample. Statist. Med. 2009; 28:3083–3107

The performance of different propensity score methods for estimating marginal odds ratios.

Austin, PC. Stat Med. 2007 Jul 20; 26(16):3078-94.

Bolaji Emmanuel Egbewale. Statistical issues in randomised controlled trials: a narrative synthesis,

Asian Pacific Journal of Tropical Biomedicine. Volume 5, Issue 5,

2015,Pages 354-359,ISSN 2221-1691

Ho, Daniel E. and Imai, Kosuke and King, Gary and Stuart, Elizabeth A., Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, Vol. 15, pp. 199-236, 2007, Available at SSRN: https://ssrn.com/abstract=1081983

Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists in causal inference. Journal of the Royal Statistical Society Series A. 2008;171(2):481–502.

Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;21(19):2917-2930. doi:10.1002/sim.1296

Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25(1):1-21. doi:10.1214/09-STS313

Thoemmes, F. J. & Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate Behavioral Research, 46(1), 90-118.

Balance diagnostics after propensity score matching

Zhongheng Zhang1, Hwa Jung Kim2,3, Guillaume Lonjon4,5,6,7, Yibing Zhu8; written on behalf of AME Big-Data Clinical Trial Collaborative Group