Wednesday, September 30, 2020

Calibration, Discrimination, and Ethics

Classification models with binary and categorical outcomes are often assessed based on the c-statistic or area under the ROC curve. (see also:

This metric ranges between 0 and 1 and provides a summary of model performance in terms of its ability to rank observations. For example, if a model is developed to predict the probability of default, the area under the ROC curve can be interpreted as the probability that a randomly chosen observation from the observed default class will be ranked higher (based on model predictions or probability) than a chosen observation from the observed non-default class (Provost and Fawcett, 2013). This metric is not without criticism and should not be used as the exclusive criteria for model assessment in all cases. As argued by Cook (2017):

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

Calibration is an alternative metric for model assessment. Calibration measures the agreement between observed and predicted risk or the closeness of model predicted probability to the underlying probability of the population under study. Both discrimination and calibration are included in the National Quality Forum’s Measure of Evaluation Criteria. However, many have noted that calibration is largely underutilized by practitioners in the data science and predictive modeling communities (Walsh et al., 2017; Van Calster et al., 2019). Models that perform well on the basis of discrimination (area under the ROC) may not perform well based on calibration (Cook,2017). And in fact a model with lower ROC scores could actually calibrate better than a model with higher ROC scores (Van Calster et al., 2019). This can lead to ethical concerns as lack of calibration in predictive models can in application result in decisions that lead to over or under utilization of resources (Van Calster et al, 2019).

Others have argued there are ethical considerations as well:

“Rigorous calibration of prediction is important for model optimization, but also ultimately crucial for medical ethics. Finally, the amelioration and evolution of ML methodology is about more than just technical issues: it will require vigilance for our own human biases that makes us see only what we want to see, and keep us from thinking critically and acting consistently.” (Levy, 2020)

Van Calster et al. (2019), Colin et al. (2017) and Steyerberg et al. (2010) provide guidance on ways of assessing model calibration.

Frank Harrel provides a great discussion about choosing the correct metrics for model assessment along with a wealth of resources here.


Matrix of Confusion. Drew Griffin Levy, PhD. GoodScience, Inc.  Accessed 9/22/2020

Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. Tom Fawcett.O’Reilly. CA. 2013.

Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2

Colin G. Walsh, Kavya Sharman, George Hripcsak, Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk, Journal of Biomedical Informatics, Volume 76, 2017, Pages 9-18, ISSN 1532-0464,

 Van Calster, B., McLernon, D.J., van Smeden, M. et al. Calibration: the Achilles heel of predictive analytics. BMC Med 17, 230 (2019).

Wednesday, September 2, 2020

Blocking and Causality

In a previous post I discussed block randomized designs. 

Duflo et al (2008) describe this in more detail:

"Since the covariates to be used must be chosen in advance in order to avoid specification searching and data mining, they can be used to stratify (or block) the sample in order to improve the precision of estimates. This technique (¯rst proposed by Fisher (1926)) involves dividing the sample into groups sharing the same or similar values of certain observable characteristics. The randomization ensures that treatment and control groups will be similar in expectation. But stratification is used to ensure that along important observable dimensions this is also true in practice in the sample....blocking is more efficient than controlling ex post for these variables, since it ensures an equal proportion of treated and untreated unit within each block and therefore minimizes variance."

They also elaborate on blocking when you are interested in subgroup analysis:

"Apart from reducing variance, an important reason to adopt a stratified design is when the researchers are interested in the effect of the program on specific subgroups. If one is interested in the effect of the program on a sub-group, the experiment must have enough power for this subgroup (each sub-group constitutes in some sense a distinct experiment). Stratification according to those subgroups then ensure that the ratio between treatment and control units is determined by the experimenter in each sub-group, and can therefore be chosen optimally. It is also an assurance for the reader that the sub-group analysis was planned in advance."

Dijkman et al (2009) discuss subgroup analysis in blocked or stratified designs in more detail:

"When stratification of randomization is based on subgroup variables, it is more likely that treatment assignments within subgroups are balanced, making each subgroup a small trial. Because randomization makes it likely for the subgroups to be similar in all aspects except treatment, valid inferences about treatment efficacy within subgroups are likely to be drawn. In post hoc subgroup analyses, the subgroups are often incomparable because no stratified randomization is performed. Additionally, stratified randomization is desirable since it forces researchers to define subgroups before the start of the study."

Both of these accounts seem very much consistent with each other in terms of thinking about randomization within subgroups creating a mini trial where causal inferences can be drawn. But I think the key thing to consider is they are referring to comparisons made WITHIN sub groups and not necessarily BETWEEN subgroups. 

Gerber and Green discuss this in one of their chapters on analysis of block randomized experiments :

"Regardless of whether one controls for blocks using weighted regression or regression with indicators for blocks, they key principle is to compare treatment and control subjects within blocks, not between blocks."

When we start to compare treatment and control units BETWEEN blocks or subgroups we are essentially interpreting covariates and this cannot be done with a causal interpretation. Green and Gerber discuss an example related to differences in the performance of Hindu vs. Muslim schools. 

"it could just be that religion is a marker for a host of unmeasured attributes that are correlated with educational outcomes. The set of covariates included in an experimental analysis need not be a complete list of factors that affect outcomes: the fact that some factors are left out or poorly measured is not a source of bias when the aim is to measure the average treatment effect of the random intervention. Omitted variables and mismeasurement, however, can lead to sever bias if the aim is to draw causal inferences about the effects of covariates. Causal interpretation of the covariates encounters all of the threats to inference associated with analysis of observational data."

In other words, these kinds of comparisons face the the same challenges related to interpreting control variables in a regression in an observational setting (see Keele, 2020). 

But why doesn't randomization within religion allow us to make causal statements about these comparisons? Let's think about a different example. Suppose we wanted to measure treatment effects for some kind of educational intervention and we were interested in subgroup differences in the outcome between public and private high schools. We could randomly assign treatments and controls within the public school population and do the same within the private school population. We know overall treatment effects would be unbiased because the school type would be perfectly balanced (instead of balanced just on average in a completely random design) and we would expect all other important confounders to be balanced between treatments and controls on average. 

We also know that within the group of private schools the treatment and controls should at least on average be balanced for certain confounders (median household income, teacher's education/training/experience, and perhaps an unobservable confounder related to student motivation). 

We could say the same thing about comparisons WITHIN the subgroup of public schools. But there is no reason to believe that the treated students in private schools would be comparable to the treated students in public schools because there is no reason to expect that important confounders would be balanced when making the comparisons. 

Assume we are looking at differences in first semester college GPA. Maybe within the private subgroup we find that treated treated students on average have a first semester college GPA that is .25 points higher the comparable control group. But within the public school subgroup, this differences was only .10. We can say that there is a difference in outcomes of .15 points between groups but can we say this is causal? Is the difference really related to school type or is school type really a proxy for income, teacher quality, or motivation? If we increased motivation or income in the public schools would that make up the difference? We might do better if our design originally stratified on all of these important confounders like income and teacher education. Then we could compare students in both public and private schools with similar family incomes and teachers of similar credentials. But...there is no reason to believe that student motivation would be balanced. We can't block or stratify on an unobservable confounder. Again, as Gerber and Green state, we find ourselves in a world that borders between experimental and non-experimental methods. Simply, the subgroups defined by any particular covariate that itself is not or cannot be randomly assigned may have different potential outcomes. What we can say from these results is that school type predicts the outcome but does not necessarily cause it.

Gerber and Green expound on this idea:

"Subgroup analysis should be thought of as exploratory or descriptive analysis....if the aim is simply to predict when treatment effects will be large, the researcher need not have a correctly specified causal model that explains treatment effects (see to explain or predict)....noticing that treatment effects tend to be large in some groups and absent from others can provide important clues about why treatments work. But resist the temptation to think subgroup differences establish the causal effect of randomly varying one's subgroup attributes."


Dijkman B, Kooistra B, Bhandari M; Evidence-Based Surgery Working Group. How to work with a subgroup analysis. Can J Surg. 2009;52(6):515-522. 

Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton

Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31

Friday, August 28, 2020

Blocked Designs

When I first learned about randomized complete block designs as an undergraduate to me it was just another set of computations to memorize for the test. (this was before I understood statistics as a way of thinking not a box of tools). However it is an important way to think about your experiment.

In Steel and Torrie's well known experimental design text, they discuss:

"in many situations it is known beforehand that certain experimental units, if treated alike, will behave differently....designs or layouts can be constructed so that the portion of variability attributed to the recognized source can be measured and thus excluded from the experimental error." 

In other words, blocking improves the precision of estimates in randomized designs. In experimental research, blocking often implies randomly assigning treatment and control groups within blocks (or strata) based on a set of observed pre-treatment covariates. By guaranteeing that treatment and control units are identical in their covariate values, we eliminate the chance that differences in covariates among treatment and control units will impact inferences. 

With a large enough sample size and successfully implemented randomization, we expect treatment and control units to be 'balanced' at least on average across covariate values. However, it is always wise to assess covariate balance after randomization to ensure that this is the case. 

One argument for blocking is to prevent such scenarios. In cases where randomization is deemed to be successfully implemented, treatment and control units will have similar covariate values on average or in expectation. But with block randomization treatment and control units are guaranteed to be identical across covariate values. 

Blocking vs. Matching and Regression

It is common practice, if we find imbalances or differences in certain covariate or control variables that we 'control' for this after the fact often using linear regression. Gerber and Green discuss blocking extensively. They claim however that for experiments with sample sizes with more than 100 observations, the gains in precision from block randomization over a completely randomized design (with possible regression adjustments with controls for imbalances) become negligible (citing Rosnberger and Lachin, 2002). However they caution. Having to resort to regression with controls introduces the temptation to interpret control variables causally in ways that are inappropriate (see also Keele, 2020)

In observational settings where randomization does not occur, we often try to mimic the covariate balance we would get in a randomized experiment through matching or regression. But there are important differences. Regression and matching create comparisons where covariate values are the same across treatment and control units in expectation or 'on average' for observable and measurable variables but not necessarily unobservable confounders. Randomization ensures on average that we get balanced comparisons for even unobservable and unmeasurable characteristics. King and Nielson are critical of propensity score matching in that they claim it attempts to mimic a completely randomized design when we should be striving for observational methods that attempt to target blocked randomized designs.

"The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods."


Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton

Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31

Gary King and Richard Nielsen. 2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, 27, 4. Copy at

Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists in causal inference. Journal of the Royal Statistical Society Series A. 2008;171(2):481–502.

Principles and Procedures of Statistics: A Biometrical Approach. Robert George Douglas Steel, James Hiram Torrie, David A. Dickey. McGraw-Hill .1997 

Sunday, July 26, 2020

Assessing Balance for Matching and RCTs

Assessing the balance between treatment and control groups across control variables is an important part of propensity score matching. It's heuristically an attempt to ‘recreate’ a situation similar to a randomized experiment where all subjects are essentially the same except for the treatment (Thoemmes and Kim, 2011). Matching itself should not be viewed so much as an estimation technique, but a pre-processing step to ensure that members assigned to treatment and control groups have similar covariate distributions 'on average' (Ho et al., 2007)

This understanding of matching often gets lost among practitioners and it is evident in attempts to use statistical significance testing (like t-tests) to assess baseline differences in covariates between treatment and control groups . This is often (mistakenly) done as a means to (1) determine which variables to match on and (2) determine if appropriate balance has been achieved after matching.

Stuart (2010) discusses this:

"Although common, hypothesis tests and p-values that incorporate information on the sample size (e.g., t-tests) should not be used as measures of balance, for two main reasons (Austin, 2007; Imai et al., 2008). First, balance is inherently an in-sample property, without reference to any broader population or super-population. Second, hypothesis tests can be misleading as measures of balance, because they often conflate changes in balance with changes in statistical power. Imai et al. (2008) show an example where randomly discarding control individuals seemingly leads to increased balance, simply because of the reduced power."

Imai et al. (2008) elaborate. Using simulation they demonstrate that:

"The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics that are used in applied research. For example, the same simulation applied to the Kolmogorov–Smirnov test shows that its p-value monotonically increases as we randomly drop more control units. This is because a smaller sample size typically produces less statistical power and hence a larger p-value"


"from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant in this context"

Austin (2009) has a paper devoted completely to balance diagnostics for propensity score matching (absolute standardized differences are recommended as an alternative to using significance tests).

OK so based on this view of matching as a data pre-processing step in an observational setting, using hypothesis tests and p-values to assess balance doesn't seem to make sense. But what about randomized controlled trials and randomized field trials? In those cases randomization is used as a means to achieve balance outright instead of matching after the fact in an observational setting. Even better, we hope to achieve balance on unobservable confounders that we could never measure or match on. But sometimes randomization isn't perfect in this regard, especially in smaller samples. So we still may want to investigate treatment and control covariate balance in this setting in order to (1) identify potential issues with randomization (2) statistically control for any chance imbalances.

Altman (1985) discusses the implication of using significance tests to assess balance in randomized clinical trials:

"Randomised allocation in a clinical trial does not guarantee that the treatment groups are comparable with respect to baseline characteristics. It is common for differences between treatment groups to be assessed by significance tests but such tests only assess the correctness of the randomisation, not whether any observed imbalances between the groups might have affected the results of the trial. In particular, it is quite unjustified to conclude that variables that are not significantly differently distributed between groups cannot have affected the results of the trial."

"The possible effect of imbalance in a prognostic factor is considered, and it is shown that non‐significant imbalances can exert a strong influence on the observed result of the trial, even when the risk associated with the factor is not all that great."

Even though this was in the context of an RCT and not an observational study, this seems to parallel the simulation results from Imai et al. (2008). For some reason, Altman made me chuckle when I read this:

"Putting these two ideas together, performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a
procedure is clearly absurd."

More recent discussions include Egbewale (2015) and also Pocock et al. (2002) who found that nearly 50% of practitioners were still employing significance testing to assess covariate balance in randomized trials.

So if using significance tests for balance assessment in matched and randomized studies is so 1985....why are we still doing it?


Altman, D.G. (1985), Comparability of Randomised Groups. Journal of the Royal Statistical Society: Series D (The Statistician), 34: 125-136. doi:10.2307/2987510

Austin,  PC. Balance diagnostics for comparing the distribution of baseline
covariates between treatment groups in propensity-score
matched sample. Statist. Med. 2009; 28:3083–3107

The performance of different propensity score methods for estimating marginal odds ratios.
Austin, PC. Stat Med. 2007 Jul 20; 26(16):3078-94.

Bolaji Emmanuel Egbewale. Statistical issues in randomised controlled trials: a narrative synthesis,
Asian Pacific Journal of Tropical Biomedicine. Volume 5, Issue 5,
2015,Pages 354-359,ISSN 2221-1691

Ho, Daniel E. and Imai, Kosuke and King, Gary and Stuart, Elizabeth A., Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, Vol. 15, pp. 199-236, 2007, Available at SSRN:

Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists in causal inference. Journal of the Royal Statistical Society Series A. 2008;171(2):481–502.

Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;21(19):2917-2930. doi:10.1002/sim.1296

Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25(1):1-21. doi:10.1214/09-STS313

Thoemmes, F. J. & Kim, E. S. (2011). A systematic review of propensity score methods in the social  sciences. Multivariate Behavioral Research, 46(1), 90-118.

Balance diagnostics after propensity score matching
Zhongheng Zhang1, Hwa Jung Kim2,3, Guillaume Lonjon4,5,6,7, Yibing Zhu8; written on behalf of AME Big-Data Clinical Trial Collaborative Group

Wednesday, May 6, 2020

The Value of Business Experiments Part 3: Strategy and Alignment

In previous posts I have discussed the value proposition of business experiments from both a classical and behavioral economic perspective. This series of posts has been greatly influenced by Jim Manzi's book 'Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society.' Midway through the book Manzi highlights three important things that experiments in business can do:

1) They provide precision around the tactical implementation of strategy
2) They provide feedback on the performance of a strategy which allows for refinements to be driven by evidence
3) They help achieve organizational and strategic alignment

Manzi explains that within any corporation there are always silos and subcultures advocating competing strategies with perverse incentives and agendas in pursuit of power and control. How do we know who is right and which programs or ideas are successful considering the many factors that could be influencing any outcome of interest? Manzi describes any environment where the number of causes of variation are enormous as an environment that has 'high causal density.' We can claim to address this with a data driven culture, but what does that mean? Modern companies in a digital age with AI and big data are drowning in data. This makes it easy to adorn rhetoric in advanced analytical frameworks. Because data seldom speaks, anyone can speak for the data through wily data story telling.

As Jim Manzi and Stefan Thomke discuss in Harvard Business Review:

"business experiments can allow companies to look beyond correlation and investigate causality....Without it, executives have only a fragmentary understanding of their businesses, and the decisions they make can easily backfire."

In complex environments with high causal density, we don't know enough about the nature and causes of human behavior, decisions, and causal paths from actions to outcomes to list them all and measure and account for them even if we could agree how to measure them. This is the nature of decision making under uncertainty. But, as R.A. Fisher taught us with his agricultural experiments, randomized tests allow us to account for all of these hidden factors (Manzi calls them hidden conditionals). Only then does our data stand a chance to speak truth.

Having causal knowledge helps identify more informed and calculated risks vs. risks taken on the basis of gut instinct, political motivation, or overly optimistic data-driven correlational pattern finding analytics.

Experiments add incremental knowledge and value to business. No single experiment is going to be a 'killer app' that by itself will generate millions in profits. But in aggregate the knowledge created by experiments probably offers the greatest strategic value across an enterprise compared to any other analytic method.

As Luke Froeb writes in Managerial Economics, A Problem Solving Approach (3rd Edition):

"With the benefit of hindsight, it is easy to identify successful strategies (and the reasons for their success) or failed strategies (and the reason for their failures). Its much more difficult to identify successful or failed strategies before they succeed or fail."

Business experiments offer the opportunity to test strategies early on a smaller scale to get causal feedback about potential success or failure before fully committing large amounts of irrecoverable resources. This takes the concept of failing fast to a whole new level.

Achieving the greatest value from business experiments requires leadership commitment.  It also demands a culture that is genuinely open to learning through a blend of trial and error, data driven decision making, and the infrastructure necessary for implementing enough tests and iterations to generate the knowledge necessary for rapid learning and innovation. The result is a corporate culture that allows an organization to formulate, implement, and modify strategy faster and more tactfully than others.

See also:
The Value of Business Experiments: The Knowledge Problem
The Value of Business Experiments Part 2: A Behavioral Economics Perspective
Statistics is a Way of Thinking, Not a Box of Tools

Tuesday, April 21, 2020

The Value of Business Experiments Part 2: A Behavioral Economic Perspective

In my previous post I discussed the value proposition of business experiments from a classical economic perspective. In this post I want to view this from a behavioral economic perspective. From this point of view business experiments can prove to be invaluable with respect to challenges related to overconfidence and decision making under uncertainty.

Heuristic Data Driven Decision Making and Data Story Telling

 In a fast paced environment, decisions are often made quickly and often based on gut decisions. Progressive companies have tried as much as possible to leverage big data and analytics to be data driven organizations. Ideally, leveraging data would help to override biases and often gut instincts and ulterior motives that may stand behind a scientific hypothesis or business question. One of the many things we have learned from behavioral economics is that humans tend to over interpret data into unreliable patterns that lead to incorrect conclusions. Francis Bacon recognized this over 400 years ago:

"the human understanding is of its own nature prone to suppose the existence of more order and regularity in the world than it finds" 

Decision makers can be easily duped by big data, ML, AI, and various  BI tools into thinking that their data is speaking to them. As Jim Manzi and Stefan Thomke state in Harvard Business Review in the absence of formal randomized testing:

"executives end up misinterpreting statistical noise as causation—and making bad decisions"

Data seldom speaks, and when it does it is often lying. This is the impetus behind the introduction of what became the scientific method. The true art and science of data science is teasing out the truth, or what version of truth can be found in the story being told. I think this is where field experiments are most powerful and create the greatest value in the data science space. 

Decision Making Under Uncertainty, Risk Aversion, and The Dunning-Kruger Effect

Kahneman (in Thinking Fast and Slow) makes an interesting observation in relation to managerial decision making. Very often managers reward peddlers of even dangerously misleading information while disregarding or even punishing merchants of truth. Confidence in a decision is often based more on the coherence of a story than the quality of information that supports it. Those that take risks based on bad information, when it works out, are often rewarded. To quote Kahneman:

"a few lucky gambles can crown a reckless leader with a Halo of prescience and boldness"

As Kahneman discusses in Thinking Fast and Slow, those that often take the biggest risks are not necessarily any less risk averse, they simply are often less aware of the risks they are actually taking. This leads to overconfidence and lack of appreciation for uncertainty, and a culture where a solution based on pretended knowledge is often preferred and even rewarded. Its easy to see how the Dunning-Kruger effect would dominate. This feeds a viscous cycle that leads to collective blindness toward risk and uncertainty. It leads to taking risks that should be avoided in many cases, and prevents others from considering better but perhaps less audacious risks. Field experiments can help facilitate taking more educated gambles. Thinking through an experimental design (engaging Kahneman's system 2) provides a structured way of thinking about business problems and how to truly leverage data to solve them. And the data we get from experimental results can be interpreted causally. Identification of causal effects from an experiment helps us distinguish if outcomes are likely due to a business decision, as opposed to blindly trusting gut instincts, luck, or the noisy patterns we might find in the data. 

Just as rapid cycles of experiments in a business setting can aid in the struggle with the knowledge problem, they also provide an objective and structured way of thinking about our data and the conclusions we can reach from it while avoiding as much as possible some of these behavioral pitfalls. A business culture that supports risk taking coupled with experimentation will come to value a preferred solution over pretended knowledge. That's valuable. 

See also:

Monday, April 20, 2020

The Value of Business Experiments and the Knowledge Problem

Why should firms leverage randomized business experiments? With recent advancements in computing power and machine learning, why can't they simply base all of their decisions on historical observational data? Perhaps statisticians and econometricians and others have a simple answer. Experiments may be the best (often the golden standard) way of answering causal questions. I certainly can't argue against answering causal questions (just read this blog). However, here I want to focus on a number of more fundamental reasons that experiments are necessary in business settings from the perspective of both classical and behavioral economics:

1) The Knowledge Problem
2) Behavioral Biases
3) Strategy and Tactics

In this post I want to discuss the value of business experiments from more of a neoclassical economic perspective. The fundamental problem of economics, society, and business is the knowledge problem. In his famous 1945 American Economic Review article The Use of Knowledge in Society, Hayek argues:

"the economic problem of society is not merely a problem of how to allocate 'given resources' is a problem of the utilization of knowledge which is not given to anyone in its totality."

A really good parable explaining the knowledge problem is the essay I, Pencil by Leonard E. Read. The fact that no one person possesses the necessary information to make something that seems so simple as a basic number 2 pencil captures the essence of the knowledge problem.

If you remember your principles of economics, you know that the knowledge problem is solved by a spontaneous order guided by prices which reflect tradeoffs based on the disaggregated incomplete and imperfect knowledge and preferences of millions (billions) of individuals. Prices serve both the function of providing information and the incentives to act on that information. It is through this information creation and coordinating process that prices help solve the knowledge problem.

Prices solve the problem of calculation that Hayek alluded to in his essay, and they are what coordinate all of the activities discussed in I, Pencil. The knowledge problem explains how market economies work, while at the same time, socially planned economies historically have failed to allocate resources in a manner that has not resulted in shortages, surpluses, and collapse.

In Living Economics: Yesterday, Today, and Tommorow by Peter J. Boettke, he discusses the knowledge problem in the context of firms and the work of economics Murray Rothbard:

"firms cannot vertically integrate without facing a calculation problem....vertical integration elminates the external market for producer goods."

In essence, and this seems consistent with Coase, as firms integrate to eliminate transactions costs they also eliminate the markets which generate the prices that solve the knowledge problem! In a way firms could be viewed as little islands with socially planned economies in a sea of market competition. As Luke Froeb masterfully illustrates in his text Managerial Economics: A Problem Solving Approach (3rd Ed), decisions within firms in effect create regulations, taxes, and subsidies that destroy wealth creating transactions. Managers should make decisions that consummate the most wealth creating transactions (or do their best not to destroy, discourage, or prohibit wealth creating transactions).

So how do we solve the knowledge problems in firms without the information creating and coordinating role of prices? Whenever mistakes are made, Luke Froeb provides this problem solving algorithm that asks:

1) Who is making the bad decision?
2) Do they have enough information to make a good decision?
3) Do they have the incentive to make a good decision?

In essence, in absence of prices, we must try to answer the same questions that prices often resolve. And we could leverage business experiments to address the second question above. Experiments can provide important causal decision making information. While I would never argue that data science, advanced analytics, artificial intelligence, or any field experiment could ever solve the knowledge problem, I will argue that business experiments become extremely valuable because of the knowledge problem within firms.

Going back to I, Pencil and Hayek's essay, the knowledge problem is solved through the spontaneous coordination of multitudes of individual plans via markets. Through a trial and error process where feedback is given through prices, the plans that do the best job coordinating peoples choices are adopted. Within firms there are often only a few plans compared to the market through various strategies and tactics. But as discussed in Jim Manzi's book Uncontrolled, firms can mimic this trial and error process through iterative experimentation interspersed with theory and subject matter expertise. Experiments help establish causal facts, but it takes theory and subject matter expertise to understand which facts are relevant.

In essence, while experiments don't perfectly emulate the same kind of evolutionary feedback mechanisms prices deliver in market competition, an iterative test and learn culture within a business may provide the best strategy for dealing with the knowledge problem. And that is one of many ways that business experiments are able to contribute value.

See also:

Statistics is a Way of Thinking, Not a Box of Tools