Wednesday, September 2, 2020

Blocking and Causality

In a previous post I discussed block randomized designs. 

Duflo et al (2008) describe this in more detail:

"Since the covariates to be used must be chosen in advance in order to avoid specification searching and data mining, they can be used to stratify (or block) the sample in order to improve the precision of estimates. This technique (¯rst proposed by Fisher (1926)) involves dividing the sample into groups sharing the same or similar values of certain observable characteristics. The randomization ensures that treatment and control groups will be similar in expectation. But stratification is used to ensure that along important observable dimensions this is also true in practice in the sample....blocking is more efficient than controlling ex post for these variables, since it ensures an equal proportion of treated and untreated unit within each block and therefore minimizes variance."

They also elaborate on blocking when you are interested in subgroup analysis:

"Apart from reducing variance, an important reason to adopt a stratified design is when the researchers are interested in the effect of the program on specific subgroups. If one is interested in the effect of the program on a sub-group, the experiment must have enough power for this subgroup (each sub-group constitutes in some sense a distinct experiment). Stratification according to those subgroups then ensure that the ratio between treatment and control units is determined by the experimenter in each sub-group, and can therefore be chosen optimally. It is also an assurance for the reader that the sub-group analysis was planned in advance."

Dijkman et al (2009) discuss subgroup analysis in blocked or stratified designs in more detail:

"When stratification of randomization is based on subgroup variables, it is more likely that treatment assignments within subgroups are balanced, making each subgroup a small trial. Because randomization makes it likely for the subgroups to be similar in all aspects except treatment, valid inferences about treatment efficacy within subgroups are likely to be drawn. In post hoc subgroup analyses, the subgroups are often incomparable because no stratified randomization is performed. Additionally, stratified randomization is desirable since it forces researchers to define subgroups before the start of the study."

Both of these accounts seem very much consistent with each other in terms of thinking about randomization within subgroups creating a mini trial where causal inferences can be drawn. But I think the key thing to consider is they are referring to comparisons made WITHIN sub groups and not necessarily BETWEEN subgroups. 

Gerber and Green discuss this in one of their chapters on analysis of block randomized experiments :

"Regardless of whether one controls for blocks using weighted regression or regression with indicators for blocks, they key principle is to compare treatment and control subjects within blocks, not between blocks."

When we start to compare treatment and control units BETWEEN blocks or subgroups we are essentially interpreting covariates and this cannot be done with a causal interpretation. Green and Gerber discuss an example related to differences in the performance of Hindu vs. Muslim schools. 

"it could just be that religion is a marker for a host of unmeasured attributes that are correlated with educational outcomes. The set of covariates included in an experimental analysis need not be a complete list of factors that affect outcomes: the fact that some factors are left out or poorly measured is not a source of bias when the aim is to measure the average treatment effect of the random intervention. Omitted variables and mismeasurement, however, can lead to sever bias if the aim is to draw causal inferences about the effects of covariates. Causal interpretation of the covariates encounters all of the threats to inference associated with analysis of observational data."

In other words, these kinds of comparisons face the the same challenges related to interpreting control variables in a regression in an observational setting (see Keele, 2020). 

But why doesn't randomization within religion allow us to make causal statements about these comparisons? Let's think about a different example. Suppose we wanted to measure treatment effects for some kind of educational intervention and we were interested in subgroup differences in the outcome between public and private high schools. We could randomly assign treatments and controls within the public school population and do the same within the private school population. We know overall treatment effects would be unbiased because the school type would be perfectly balanced (instead of balanced just on average in a completely random design) and we would expect all other important confounders to be balanced between treatments and controls on average. 





We also know that within the group of private schools the treatment and controls should at least on average be balanced for certain confounders (median household income, teacher's education/training/experience, and perhaps an unobservable confounder related to student motivation). 

We could say the same thing about comparisons WITHIN the subgroup of public schools. But there is no reason to believe that the treated students in private schools would be comparable to the treated students in public schools because there is no reason to expect that important confounders would be balanced when making the comparisons. 

Assume we are looking at differences in first semester college GPA. Maybe within the private subgroup we find that treated treated students on average have a first semester college GPA that is .25 points higher the comparable control group. But within the public school subgroup, this differences was only .10. We can say that there is a difference in outcomes of .15 points between groups but can we say this is causal? Is the difference really related to school type or is school type really a proxy for income, teacher quality, or motivation? If we increased motivation or income in the public schools would that make up the difference? We might do better if our design originally stratified on all of these important confounders like income and teacher education. Then we could compare students in both public and private schools with similar family incomes and teachers of similar credentials. But...there is no reason to believe that student motivation would be balanced. We can't block or stratify on an unobservable confounder. Again, as Gerber and Green state, we find ourselves in a world that borders between experimental and non-experimental methods. Simply, the subgroups defined by any particular covariate that itself is not or cannot be randomly assigned may have different potential outcomes. What we can say from these results is that school type predicts the outcome but does not necessarily cause it.

Gerber and Green expound on this idea:

"Subgroup analysis should be thought of as exploratory or descriptive analysis....if the aim is simply to predict when treatment effects will be large, the researcher need not have a correctly specified causal model that explains treatment effects (see to explain or predict)....noticing that treatment effects tend to be large in some groups and absent from others can provide important clues about why treatments work. But resist the temptation to think subgroup differences establish the causal effect of randomly varying one's subgroup attributes."

References

Dijkman B, Kooistra B, Bhandari M; Evidence-Based Surgery Working Group. How to work with a subgroup analysis. Can J Surg. 2009;52(6):515-522. 

Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton

Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31

Friday, August 28, 2020

Blocked Designs

When I first learned about randomized complete block designs as an undergraduate to me it was just another set of computations to memorize for the test. (this was before I understood statistics as a way of thinking not a box of tools). However it is an important way to think about your experiment.

In Steel and Torrie's well known experimental design text, they discuss:

"in many situations it is known beforehand that certain experimental units, if treated alike, will behave differently....designs or layouts can be constructed so that the portion of variability attributed to the recognized source can be measured and thus excluded from the experimental error." 

In other words, blocking improves the precision of estimates in randomized designs. In experimental research, blocking often implies randomly assigning treatment and control groups within blocks (or strata) based on a set of observed pre-treatment covariates. By guaranteeing that treatment and control units are identical in their covariate values, we eliminate the chance that differences in covariates among treatment and control units will impact inferences. 

With a large enough sample size and successfully implemented randomization, we expect treatment and control units to be 'balanced' at least on average across covariate values. However, it is always wise to assess covariate balance after randomization to ensure that this is the case. 

One argument for blocking is to prevent such scenarios. In cases where randomization is deemed to be successfully implemented, treatment and control units will have similar covariate values on average or in expectation. But with block randomization treatment and control units are guaranteed to be identical across covariate values. 

Blocking vs. Matching and Regression

It is common practice, if we find imbalances or differences in certain covariate or control variables that we 'control' for this after the fact often using linear regression. Gerber and Green discuss blocking extensively. They claim however that for experiments with sample sizes with more than 100 observations, the gains in precision from block randomization over a completely randomized design (with possible regression adjustments with controls for imbalances) become negligible (citing Rosnberger and Lachin, 2002). However they caution. Having to resort to regression with controls introduces the temptation to interpret control variables causally in ways that are inappropriate (see also Keele, 2020)

In observational settings where randomization does not occur, we often try to mimic the covariate balance we would get in a randomized experiment through matching or regression. But there are important differences. Regression and matching create comparisons where covariate values are the same across treatment and control units in expectation or 'on average' for observable and measurable variables but not necessarily unobservable confounders. Randomization ensures on average that we get balanced comparisons for even unobservable and unmeasurable characteristics. King and Nielson are critical of propensity score matching in that they claim it attempts to mimic a completely randomized design when we should be striving for observational methods that attempt to target blocked randomized designs.

"The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods."


References:

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton

Keele, L., Stevenson, R., & Elwert, F. (2020). The causal interpretation of estimated associations in regression models. Political Science Research and Methods, 8(1), 1-13. doi:10.1017/psrm.2019.31

Gary King and Richard Nielsen. 2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, 27, 4. Copy at https://j.mp/2ovYGsW

Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists in causal inference. Journal of the Royal Statistical Society Series A. 2008;171(2):481–502.

Principles and Procedures of Statistics: A Biometrical Approach. Robert George Douglas Steel, James Hiram Torrie, David A. Dickey. McGraw-Hill .1997 

Sunday, July 26, 2020

Assessing Balance for Matching and RCTs

Assessing the balance between treatment and control groups across control variables is an important part of propensity score matching. It's heuristically an attempt to ‘recreate’ a situation similar to a randomized experiment where all subjects are essentially the same except for the treatment (Thoemmes and Kim, 2011). Matching itself should not be viewed so much as an estimation technique, but a pre-processing step to ensure that members assigned to treatment and control groups have similar covariate distributions 'on average' (Ho et al., 2007)

This understanding of matching often gets lost among practitioners and it is evident in attempts to use statistical significance testing (like t-tests) to assess baseline differences in covariates between treatment and control groups . This is often done as a means to (1) determine which variables to match on and (2) determine if appropriate balance has been achieved after matching.

Stuart (2010) discusses this:

"Although common, hypothesis tests and p-values that incorporate information on the sample size (e.g., t-tests) should not be used as measures of balance, for two main reasons (Austin, 2007; Imai et al., 2008). First, balance is inherently an in-sample property, without reference to any broader population or super-population. Second, hypothesis tests can be misleading as measures of balance, because they often conflate changes in balance with changes in statistical power. Imai et al. (2008) show an example where randomly discarding control individuals seemingly leads to increased balance, simply because of the reduced power."

Imai et al. (2008) elaborate. Using simulation they demonstrate that:

"The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics that are used in applied research. For example, the same simulation applied to the Kolmogorov–Smirnov test shows that its p-value monotonically increases as we randomly drop more control units. This is because a smaller sample size typically produces less statistical power and hence a larger p-value"

and

"from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant in this context"

Austin (2009) has a paper devoted completely to balance diagnostics for propensity score matching (absolute standardized differences are recommended as an alternative to using significance tests).

OK so based on this view of matching as a data pre-processing step in an observational setting, using hypothesis tests and p-values to assess balance may not make sense. But what about randomized controlled trials and randomized field trials? In those cases randomization is used as a means to achieve balance outright instead of matching after the fact in an observational setting. Even better, we hope to achieve balance on unobservable confounders that we could never measure or match on. But sometimes randomization isn't perfect in this regard, especially in smaller samples. So we still may want to investigate treatment and control covariate balance in this setting in order to (1) identify potential issues with randomization (2) statistically control for any chance imbalances.

Altman (1985) discusses the implication of using significance tests to assess balance in randomized clinical trials:

"Randomised allocation in a clinical trial does not guarantee that the treatment groups are comparable with respect to baseline characteristics. It is common for differences between treatment groups to be assessed by significance tests but such tests only assess the correctness of the randomisation, not whether any observed imbalances between the groups might have affected the results of the trial. In particular, it is quite unjustified to conclude that variables that are not significantly differently distributed between groups cannot have affected the results of the trial."

"The possible effect of imbalance in a prognostic factor is considered, and it is shown that non‐significant imbalances can exert a strong influence on the observed result of the trial, even when the risk associated with the factor is not all that great."

Even though this was in the context of an RCT and not an observational study, this seems to parallel the simulation results from Imai et al. (2008). Altman seems indignant about the practice:

"Putting these two ideas together, performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a
procedure is clearly absurd."

More recent discussions include Egbewale (2015) and also Pocock et al. (2002) who found that nearly 50% of practitioners were still employing significance testing to assess covariate balance in randomized trials.

So if using significance tests for balance assessment in matched and randomized studies is so 1985....why are we still doing it?

References:

Altman, D.G. (1985), Comparability of Randomised Groups. Journal of the Royal Statistical Society: Series D (The Statistician), 34: 125-136. doi:10.2307/2987510

Austin,  PC. Balance diagnostics for comparing the distribution of baseline
covariates between treatment groups in propensity-score
matched sample. Statist. Med. 2009; 28:3083–3107

The performance of different propensity score methods for estimating marginal odds ratios.
Austin, PC. Stat Med. 2007 Jul 20; 26(16):3078-94.

Bolaji Emmanuel Egbewale. Statistical issues in randomised controlled trials: a narrative synthesis,
Asian Pacific Journal of Tropical Biomedicine. Volume 5, Issue 5,
2015,Pages 354-359,ISSN 2221-1691

Ho, Daniel E. and Imai, Kosuke and King, Gary and Stuart, Elizabeth A., Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, Vol. 15, pp. 199-236, 2007, Available at SSRN: https://ssrn.com/abstract=1081983

Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists in causal inference. Journal of the Royal Statistical Society Series A. 2008;171(2):481–502.

Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;21(19):2917-2930. doi:10.1002/sim.1296

Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25(1):1-21. doi:10.1214/09-STS313

Thoemmes, F. J. & Kim, E. S. (2011). A systematic review of propensity score methods in the social  sciences. Multivariate Behavioral Research, 46(1), 90-118.

Balance diagnostics after propensity score matching
Zhongheng Zhang1, Hwa Jung Kim2,3, Guillaume Lonjon4,5,6,7, Yibing Zhu8; written on behalf of AME Big-Data Clinical Trial Collaborative Group

Wednesday, May 6, 2020

Experimentation and Causal Inference: Strategy and Innovation

Knowledge is the most important resource in a firm and the essence of organizational capability, innovation, value creation, strategy, and competitive advantage. Causal knowledge is no exception.In previous posts I have discussed the value proposition of experimentation and causal inference from both mainline and behavioral economic perspectives. This series of posts has been greatly influenced by Jim Manzi's book 'Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society.' Midway through the book Manzi highlights three important things that experimentation and causal inference in business settings can do:

1) Precision around the tactical implementation of strategy
2) Feedback on the performance of a strategy and refinements driven by evidence
3) Achievement of organizational and strategic alignment

Manzi explains that within any corporation there are always silos and subcultures advocating competing strategies with perverse incentives and agendas in pursuit of power and control. How do we know who is right and which programs or ideas are successful, considering the many factors that could be influencing any outcome of interest?  Manzi describes any environment where the number of causes of variation are enormous as an environment that has 'high causal density.' We can claim to address this with a data driven culture, but what does that mean? How do we know what is, and isn't supported by data? Modern companies in a digital age with AI and big data are drowning in data. This makes it easy to adorn rhetoric in advanced analytical frameworks. Because data seldom speaks, anyone can speak for the data through wily data story telling.  Decision makers fail to make the distinction between just having data, and having evidence to support good decisions.

As Jim Manzi and Stefan Thomke discuss in Harvard Business Review:

"business experiments can allow companies to look beyond correlation and investigate causality....Without it, executives have only a fragmentary understanding of their businesses, and the decisions they make can easily backfire."

Without experimentation and causal inference, there is know way to connect the things we do with the value created. In complex environments with high causal density, we don't know enough about the nature and causes of human behavior, decisions, and causal paths from actions to outcomes to list them all and measure and account for them even if we could agree how to measure them. This is the nature of decision making under uncertainty. But, as R.A. Fisher taught us with his agricultural experiments, randomized tests allow us to account for all of these hidden factors (Manzi calls them hidden conditionals). Only then does our data stand a chance to speak truth. Experimentation and causal inference don't provide perfect information but they are the only means by which we can begin to say that we have data and evidence to inform the tactical implementation of our strategy as opposed to pretending that we do based on correlations alone. As economist F.A. Hayek once said:

"I prefer true but imperfect knowledge, even if it leaves much undetermined and unpredictable, to a pretense of exact knowledge that is likely to be false"

In Dual Transformation: How to Reposition Today's Business While Creating the Future authors discuss the importance of experimentation and causal inference as a way to navigate uncertainty in causally dense environments in what they refer to as transformation B:

“Whenever you innovate, you can never be sure about the assumptions on which your business rests. So, like a good scientist, you start with a hypothesis, then design and experiment. Make sure the experiment has clear objectives (why are you running it and what do you hope to learn). Even if you have no idea what the right answer is, make a prediction. Finally, execute in such a way that you can measure the prediction, such as running a so-called A/B test in which you vary a single factor."

Experiments aren't just tinkering and trying new things. While these are helpful to innovation, just tinkering and observing still leaves you speculating about what really works and is subject to all the same behavioral biases and pitfalls of big data previously discussed.

List and Gneezy address this in The Why Axis:

"Many businesses experiment and often...businesses always tinker...and try new things...the problem is that businesses rarely conduct experiments that allow a comparison between a treatment and control group...Business experiments are research investigations that give companies the opportunity to get fast and accurate data regarding important decisions."

Three things distinguish experimentation and causal inference from just tinkering:

1) Separation of signal from noise (statistical inference)
2) Connecting cause and effect  (causal inference)
3) Clear signals on business value that follows from 1 & 2 above

Having causal knowledge helps identify more informed and calculated risks vs. risks taken on the basis of gut instinct, political motivation, or overly optimistic and behaviorally biased data-driven correlational pattern finding analytics. 

Experimentation and causal inference add incremental knowledge and value to business. No single experiment is going to be a 'killer app' that by itself will generate millions in profits. But in aggregate the knowledge created by experimentation and causal inference probably offers the greatest strategic value across an enterprise compared to any other analytic method.

As discussed earlier, experimentation and causal inference creates value by helping manage the knowledge problem within firms, it's worth repeating again from List and Gneezy:

"We think that businesses that don't experiment and fail to show, through hard data, that their ideas can actually work before the company takes action - are wasting their money....every day they set suboptimal prices, place adds that do not work, or use ineffective incentive schemes for their work force, they effectively leave millions of dollars on the table."

As Luke Froeb writes in Managerial Economics, A Problem Solving Approach (3rd Edition):

"With the benefit of hindsight, it is easy to identify successful strategies (and the reasons for their success) or failed strategies (and the reason for their failures). It's much more difficult to identify successful or failed strategies before they succeed or fail."

Again from Dual Transformation:

"Explorers recognize they can't know the right answer, so they want to invest as little as possible in learning which of their hypotheses are right and which ones are wrong"

Experimentation and causal inference offer the opportunity to test strategies early on a smaller scale to get causal feedback about potential success or failure before fully committing large amounts of irrecoverable resources. They allow us to fail smarter and learn faster. Experimentation and causal inference play a central role in product development, strategy, and innovation across a range of industries and companies like Harrah's casinos, Capital One, Petco, Publix, State Farm, Kohl's, Wal-Mart, and Humana who have been leading in this area for decades in addition to new ventures like Amazon and Uber. 

"At Uber Labs, we apply behavioral science insights and methodologies to help product teams improve the Uber customer experience. One of the most exciting areas we’ve been working on is causal inference, a category of statistical methods that is commonly used in behavioral science research to understand the causes behind the results we see from experiments or observations...Teams across Uber apply causal inference methods that enable us to bring richer insights to operations analysis, product development, and other areas critical to improving the user experience on our platform." - From: Using Causal Inference to Improve the Uber User Experience (link)

Economist Joshua Angrist explains about his students that have went on to work for companies like Amazon: "when I ask them what are they up to they say...we're running experiments."

Achieving the greatest value from experimentation and causal inference requires leadership commitment.  It also demands a culture that is genuinely open to learning through a blend of trial and error, data driven decision making informed by theory and experiments, and the infrastructure necessary for implementing enough tests and iterations to generate the knowledge necessary for rapid learning and innovation. It requires business leaders, strategists, and product managers to think about what they are trying to achieve and asking causal questions to get there (vs. data scientists sitting in an ivory tower dreaming up models or experiments of their own). The result is a corporate culture that allows an organization to formulate, implement, and modify strategy faster and more tactfully than others.

See also:
Experimentation and Causal Inference: The Knowledge Problem
Experimentation and Causal Inference: A Behavioral Economics Perspective
Statistics is a Way of Thinking, Not a Box of Tools

Tuesday, April 21, 2020

Experimentation and Causal inference: A Behavioral Economic Perspective

In my previous post I discussed the value proposition of experimentation and causal inference from a mainline economic perspective. In this post I want to view this from a behavioral economic perspective. From this point of view experimentation and causal inference can prove to be invaluable with respect to challenges related to overconfidence and decision making under uncertainty.

Heuristic Data Driven Decision Making and Data Story Telling

In a fast paced environment, decisions are often made quickly and often based on gut decisions. Progressive companies have tried as much as possible to leverage big data and analytics to be data driven organizations. Ideally, leveraging data would help to override biases and often gut instincts and ulterior motives that may stand behind a scientific hypothesis or business question. One of the many things we have learned from behavioral economics is that humans tend to over interpret data into unreliable patterns that lead to incorrect conclusions. Francis Bacon recognized this over 400 years ago:

"the human understanding is of its own nature prone to suppose the existence of more order and regularity in the world than it finds" 

Anyone can tell a story with data. And with lots of data a good data story teller can tell a story to support any decision they want, good or bad. Decision makers can be easily duped by big data, ML, AI, and various BI tools into thinking that their data is speaking to them. As Jim Manzi and Stefan Thomke state in Harvard Business Review in the absence of experimentation and causal inference

"executives end up misinterpreting statistical noise as causation—and making bad decisions"

Data seldom speaks, and when it does it is often lying. This is the impetus behind the introduction of what became the scientific method. The true art and science of data science is teasing out the truth, or what version of truth can be found in the story being told. I think this is where experimentation and causal inference are most powerful and create the greatest value in the data science space. John List and Uri Gneezy discuss this in their book 'The Why Axis.' 

"Big data is important, but it also suffers from big problems. The underlying approach relies heavily on correlations, not causality. As David Brooks has noted, 'A zillion things can correlate with each other depending on how you structure of the data and what you compare....because our work focuses on field experiments to infer causal relationships, and because we think hard about these causal relationships of interest before generating the data we go well beyond what big data could ever deliver."

Decision Making Under Uncertainty, Risk Aversion, and The Dunning-Kruger Effect

Kahneman (in Thinking Fast and Slow) makes an interesting observation in relation to managerial decision making. Very often managers reward peddlers of even dangerously misleading information (data charlatans) while disregarding or even punishing merchants of truth. Confidence in a decision is often based more on the coherence of a story than the quality of information that supports it. Those that take risks based on bad information, when it works out, are often rewarded. To quote Kahneman:

"a few lucky gambles can crown a reckless leader with a Halo of prescience and boldness"

The essence of good decision science it understanding and seriously recognizing risk and uncertainty. As Kahneman discusses in Thinking Fast and Slow, those that often take the biggest risks are not necessarily any less risk averse, they simply are often less aware of the risks they are actually taking.  This leads to overconfidence and lack of appreciation for uncertainty, and a culture where a solution based on pretended knowledge is often preferred and even rewarded. Its easy to see how the Dunning-Kruger effect would dominate. This feeds a viscous cycle that leads to collective blindness toward risk and uncertainty. It leads to taking risks that should be avoided in many cases, and prevents others from considering smarter calculated risks.  Thinking through an experimental design (engaging Kahneman's system 2) provides a structured way of thinking about business problems and all the ways our biases and the data can fool us..  In this way experimentation and causal inference can ensure a better informed risk appetite to support decision making.

Just as rapid cycles of experiments in a business setting can aid in the struggle with the knowledge problem, experimentation and causal inference can aid us in our struggles with biased decision making and biased data.  Data alone doesn't make good decisions because good decisions require something outside the data. Good decision science leverages experimentation and causal inference that brings theory and subject matter expertise together with data so we can make better informed business decisions in the face of our own biases and the biases in data.

A business culture that supports risk taking coupled with experimentation and causal inference will come to value a preferred solution over pretended knowledge. That's valuable. 


See also:



Monday, April 20, 2020

Experimentation and Causal Inference Meet the Knowledge Problem

Why should firms leverage experimentation and causal inference? With recent advancements in computing power and machine learning, why can't they simply base all of their decisions on predictions or historical patterns discovered in the data using AI?  Perhaps statisticians and econometricians and others have a simple answer. The kinds of learnings that will help us understand the connections between decisions and the value we create require understanding causality. This requires something that may not be in the data to begin with. Experimentation and causal inference may be the best (if not the only) way of answering these questions. In this series of posts I want to focus on a number of fundamental reasons that experimentation and causal inference are necessary in business settings from the perspective of both mainline and behavioral economics:

Part 1: The Knowledge Problem
Part 2:  Behavioral Biases
Part 3:  Strategy and Tactics

In this post I want to discuss the value of experimentation and causal inference from a basic economic perspective. The fundamental problem of economics, society, and business is the knowledge problem. In his famous 1945 American Economic Review article The Use of Knowledge in Society, Hayek argues:

"the economic problem of society is not merely a problem of how to allocate 'given resources'....it is a problem of the utilization of knowledge which is not given to anyone in its totality."

A really good parable explaining the knowledge problem is the essay I, Pencil by Leonard E. Read. The fact that no one person possesses the necessary information to make something that seems so simple as a basic number 2 pencil captures the essence of the knowledge problem.

If you remember your principles of economics, you know that the knowledge problem is solved by prices which reflect tradeoffs based on the disaggregated incomplete and imperfect knowledge and preferences of millions (billions) of individuals. Prices serve both the function of providing information and the incentives to act on that information. It is through this information creation and coordinating process that prices help solve the knowledge problem. Prices solve the problem of calculation that Hayek alluded to in his essay, and they are what coordinate all of the activities discussed in I, Pencil. 

In Living Economics: Yesterday, Today, and Tommorow by Peter J. Boettke, discusses the knowledge problem in the context of firms and the work of economist Murray Rothbard:

"firms cannot vertically integrate without facing a calculation problem....vertical integration eliminates the external market for producer goods."

 Coase, also recognized that as firms integrate to eliminate transactions costs they also eliminate the markets which generate the prices that solve the knowledge problem! This tradeoff has to be managed well or firms go out of business. In a way firms could be viewed as little islands with socially planned economies in a sea of market competition. As Luke Froeb masterfully illustrates in his text Managerial Economics: A Problem Solving Approach (3rd Ed), decisions within firms in effect create regulations, taxes, and subsidies that destroy wealth creating transactions. Managers should make decisions that consummate the most wealth creating transactions (or do their best not to destroy, discourage, or prohibit wealth creating transactions).

So how do we solve the knowledge problems in firms without the information creating and coordinating role of prices? Whenever mistakes are made, Luke Froeb provides this problem solving algorithm that asks:

1) Who is making the bad decision?
2) Do they have enough information to make a good decision?
3) Do they have the incentive to make a good decision?

In essence, in absence of prices, we must try to answer the same questions that market processes often resolve. And we could leverage experimentation and causal inference to address each of the questions above:

How do we know a decision was good or bad to begin with? 
How do we get the information to make a good decision? 
What incentives or nudges work best to motivate good decision making? 

What does failure to solve the knowledge problem in firms look like in practical terms? Failure to consummate wealth creating transactions implies money left on the table - but experimentation and causal inference can help us figure out how to reclaim some of these losses. List and Gneezy address this in The Why Axis:

"We think that businesses that don't experiment and fail to show, through hard data, that their ideas can actually work before the company takes action - are wasting their money....every day they set suboptimal prices, place adds that do not work, or use ineffective incentive schemes for their work force, they effectively leave millions of dollars on the table."

Going back to I, Pencil and Hayek's essay, the knowledge problem is solved through the spontaneous coordination of multitudes of individual plans via markets. Through a trial and error process where feedback is given through prices, the plans that do the best job coordinating peoples choices are adopted. Within firms there are often only a few plans compared to the market and these are in the form of various strategies and tactics. But as discussed in Jim Manzi's book Uncontrolled, firms can mimic this trial and error feedback process through iterative experimentation.

While experimentation and causal inference cannot perfectly emulate the same kind of evolutionary feedback mechanisms prices deliver in market competition, an iterative test and learn culture within a business may provide the best strategy for dealing with the knowledge problem. And that is one of many ways that experimentation and causal inference can create value.

Monday, April 6, 2020

Statistics is a Way of Thinking, Not Just a Box of Tools

If you have taken very many statistics courses you may have gotten the impression that it's mostly a mixed bag of computations and rules for conducting hypothesis tests or making predictions or creating forecasts. While this isn't necessarily wrong, it could leave you with the opinion that statistics is mostly just a box of tools for solving problems. Absolutely statistics provides us with important tools for understanding the world, but to think of statistics as 'just tools' can have some pitfalls (besides the most common pitfall of having a hammer and viewing every problem as a nail)

For one, there is a huge gap between the theoretical 'tools' and real world application. This gap is filled with critical thinking, judgment calls, and various social norms, practices, and expectations that differ from field to field, business to business, and stakeholder to stakeholder. The art and science of statistics is often about filling this gap. That's a stretch more than 'just tools.'

The proliferation of open source programming languages (like R and Python) and point and click automated machine learning solutions (like DataRobot and H2Oai) might give the impression that after you have done your homework in framing the business problem, data and feature engineering, then all that is left is hyper-parameter tuning and plugging and playing with a number of algorithms until the 'best' one is found. It might reduce to a mechanical (sometimes time consuming if not using automated tools) exercise. The fact that a lot of this work can in fact be automated probably contributes to the 'toolbox' mentality when thinking about the much broader field of statistics as a whole. In The Book of Why, Judea Pearl provides an example explaining why statistical inference (particularly causal inference) problems can't be reduced to easily automated mechanical exercises:

"path analysis doesn't lend itself to canned programs......path analysis requires scientific thinking as does every exercise in causal inference. Statistics, as frequently practiced, discourages it and encourages "canned" procedures instead. Scientists will always prefer routine calculations on data to methods that challenge their scientific knowledge."

Indeed, a routine practice that takes a plug and play approach with 'tools' can be problematic in many cases of statistical inference. A good example is simply plugging GLM models into a difference-in-differences context. Or combining matching with difference-in-differences. While we can get these approaches to 'play well together' under the correct circumstances its not as simple as calling the packages and running the code. Viewing methods of statistical inference and experimental design as just a box of tools to be applied to data could leave one open to the plug and play fallacy. There are times you might get by with using a flathead screwdriver to tighten up a phillips head screw, but we need to understand that inferential methods are not so easily substituted even if it looks like a snug enough fit on the surface.

Understanding the business problem and data story telling are in fact two other areas of data science that would be difficult to automate . But don't let that fool you into thinking that the remainder of data science including statistical inference is simply a mechanical exercise that allows one to apply the 'best' algorithm to 'big data'. You might get by with that for a minority set of use cases that require a purely predictive or pattern finding solution but the remainder of the world's problems are not so tractable. Statistics is about more than data or the patterns we find in it. It's a way of thinking about the data.

"Causal Analysis is emphatically not just about data; in causal analysis we must incorporate some understanding of the process that produces the data and then we get something that was not in the data to begin with." - Judea Pearl, The Book of Why

Statistics is A Way of Thinking

In their well known advanced text book "Principles and Procedures of Statistics, A Biometrical Approach", Steel and Torrie push back on the attitude that statistics is just about computational tools:

"computations are required in statistics, but that is arithmetic, not mathematics nor statistics...statistics implies for many students a new way of thinking; thinking in terms of uncertainties of probabilities.....this fact is sometimes overlooked and users are tempted to forget that they have to think, that statistics cannot think for them. Statistics can however help research workers design experiments and objectively evaluate the resulting numerical data."

At the end of the day we are talking about leveraging data driven decision making to override biases and often gut instincts and ulterior motives that may stand behind a scientific hypothesis or business question.  Objectively evaluating numerical data as Steel and Torrie put it above. But what do we actually mean by data driven decision making? Mastering (if possible) statistics, inference, and experimental design is part of a lifelong process of understanding and interpreting data to solve applied problems in business and the sciences. It's not just about conducting your own analysis and being your own worst critic, but also about interpreting, criticizing, translating and applying the work of others. Biologist and geneticist Kevin Folta put this well once in a Talking Biotech podcast:

"I've trained for 30 years to be able to understand statistics and experimental design and interpretation...I'll decide based on the quality of the data and the experimental design....that's what we do."

In 'Uncontrolled' Jim Manzi states:

"observing a naturally occurring event always leaves open the possibility of confounded causes...though in reality no experimenter can be absolutely certain that all causes have been held constant the conscious and rigorous attempt to do so is the crucial distinction between an experiment and an observation."

Statistical inference and experimental design provide us with a structured way to think about real world problems and the data we have to solve them while avoiding as much as possible the gut based data story telling that intentional or not, can sometimes be confounded and misleading. As Francis Bacon once stated:

"what is in observation loose and vague is in information deceptive and treacherous"

Statistics provides a rigorous way of thinking that moves us from mere observation to useful information.

*UPDATE: Kevin Gray wrote a very good article that really gets at the spirit of a lot of what I wanted to convey in this post.

https://www.linkedin.com/pulse/statistical-thinking-nutshell-kevin-gray/

See also:

To Explain or Predict

Applied Econometrics