## Monday, April 6, 2020

### Statistics is a Way of Thinking, Not Just a Box of Tools

If you have taken very many statistics courses you may have gotten the impression that it's mostly a mixed bag of computations and rules for conducting hypothesis tests or making predictions or creating forecasts. While this isn't necessarily wrong, it could leave you with the opinion that statistics is mostly just a box of tools for solving problems. Absolutely statistics provides us with important tools for understanding the world, but to think of statistics as 'just tools' can have some pitfalls (besides the most common pitfall of having a hammer and viewing every problem as a nail)

For one, there is a huge gap between the theoretical 'tools' and real world application. This gap is filled with critical thinking, judgment calls, and various social norms, practices, and expectations that differ from field to field, business to business, and stakeholder to stakeholder. The art and science of statistics is often about filling this gap. That's a stretch more than 'just tools.'

The proliferation of open source programming languages (like R and Python) and point and click automated machine learning solutions (like DataRobot and H2Oai) might give the impression that after you have done your homework in framing the business problem, data and feature engineering, then all that is left is hyper-parameter tuning and plugging and playing with a number of algorithms until the 'best' one is found. It might reduce to a mechanical (sometimes time consuming if not using automated tools) exercise. The fact that a lot of this work can in fact be automated probably contributes to the 'toolbox' mentality when thinking about the much broader field of statistics as a whole. In The Book of Why, Judea Pearl provides an example explaining why statistical inference (particularly causal inference) problems can't be reduced to easily automated mechanical exercises:

"path analysis doesn't lend itself to canned programs......path analysis requires scientific thinking as does every exercise in causal inference. Statistics, as frequently practiced, discourages it and encourages "canned" procedures instead. Scientists will always prefer routine calculations on data to methods that challenge their scientific knowledge."

Indeed, a routine practice that takes a plug and play approach with 'tools' can be problematic in many cases of statistical inference. A good example is simply plugging GLM models into a difference-in-differences context. Or combining matching with difference-in-differences. While we can get these approaches to 'play well together' under the correct circumstances its not as simple as calling the packages and running the code. Viewing methods of statistical inference and experimental design as just a box of tools to be applied to data could leave one open to the plug and play fallacy. There are times you might get by with using a flathead screwdriver to tighten up a phillips head screw, but we need to understand that inferential methods are not so easily substituted even if it looks like a snug enough fit on the surface.

Understanding the business problem and data story telling are in fact two other areas of data science that would be difficult to automate . But don't let that fool you into thinking that the remainder of data science including statistical inference is simply a mechanical exercise that allows one to apply the 'best' algorithm to 'big data'. You might get by with that for a minority set of use cases that require a purely predictive or pattern finding solution but the remainder of the world's problems are not so tractable. Statistics is about more than data or the patterns we find in it. It's a way of thinking about the data.

"Causal Analysis is emphatically not just about data; in causal analysis we must incorporate some understanding of the process that produces the data and then we get something that was not in the data to begin with." - Judea Pearl, The Book of Why

Statistics is A Way of Thinking

In their well known advanced text book "Principles and Procedures of Statistics, A Biometrical Approach", Steel and Torrie push back on the attitude that statistics is just about computational tools:

"computations are required in statistics, but that is arithmetic, not mathematics nor statistics...statistics implies for many students a new way of thinking; thinking in terms of uncertainties of probabilities.....this fact is sometimes overlooked and users are tempted to forget that they have to think, that statistics cannot think for them. Statistics can however help research workers design experiments and objectively evaluate the resulting numerical data."

At the end of the day we are talking about leveraging data driven decision making to override biases and often gut instincts and ulterior motives that may stand behind a scientific hypothesis or business question.  Objectively evaluating numerical data as Steel and Torrie put it above. But what do we actually mean by data driven decision making? Mastering (if possible) statistics, inference, and experimental design is part of a lifelong process of understanding and interpreting data to solve applied problems in business and the sciences. It's not just about conducting your own analysis and being your own worst critic, but also about interpreting, criticizing, translating and applying the work of others. Biologist and geneticist Kevin Folta put this well once in a Talking Biotech podcast:

"I've trained for 30 years to be able to understand statistics and experimental design and interpretation...I'll decide based on the quality of the data and the experimental design....that's what we do."

In 'Uncontrolled' Jim Manzi states:

"observing a naturally occurring event always leaves open the possibility of confounded causes...though in reality no experimenter can be absolutely certain that all causes have been held constant the conscious and rigorous attempt to do so is the crucial distinction between an experiment and an observation."

Statistical inference and experimental design provide us with a structured way to think about real world problems and the data we have to solve them while avoiding as much as possible the gut based data story telling that intentional or not, can sometimes be confounded and misleading. As Francis Bacon once stated:

"what is in observation loose and vague is in information deceptive and treacherous"

Statistics provides a rigorous way of thinking that moves us from mere observation to useful information.

*UPDATE: Kevin Gray wrote a very good article that really gets at the spirit of a lot of what I wanted to convey in this post.