Econometric Sense: December 2019

Monday, December 16, 2019

Some Recommended Podcasts and Episodes on AI and Machine Learning

Something I have been interested in for some time now is both is the convergence of big data and genomics and the convergence of causal inference and machine learning.

I am a big fan of the Talking Biotech Podcast which allows me to keep up with some of the latest issues and research in biotechnology and medicine. A recent episode related to AI and machine learning covered a lot of topics that resonated with me.

There was excellent discussion on the human element involved in this work, and the importance of data data prep/feature engineering (the 80% of work that has to happen before the ML/AI can do its job) and the challenges of non-standard 'omics' data. Also the potential biases that researchers and developers can inadvertently introduce in this process. Much more including applications of machine learning and AI in this space and best ways to stay up to speed on fast changing technologies without having to be a heads down programmer.

I've been in a data science role since 2008 and have transitioned from SAS to R to python. I've been able to keep up within the domain of causal inference to the extent possible, but I keep up with broader trends I am interested in via podcasts like Talking Biotech. Below is a curated list of my favorites related to data science with a few of my favorite episodes highlighted.

1) Casual Inference - This is my new favorite podcast by two biostatisticians covering epidemiology/biostatistics/causal inference - and keeping it casual.

Fairness in Machine Learning with Sherri Rose | Episode 03 - http://casualinfer.libsyn.com/fairness-in-machine-learning-with-sherri-rose-episode-03

This episode was the inspiration for my post: When Wicked Problems Meet Biased Data.

2) Data Skeptic

3) Super Data Science

#131 - The One Purpose of Data Science https://www.superdatascience.com/podcast/podcast-one-purpose-data-science-truth-analytics

#051 Understanding the Newest Big Data Technology Buzz Terms https://www.superdatascience.com/podcast/sds-051-understanding-newest-big-data-technology-buzz-terms

#093 Evolutionary Programming -

https://www.superdatascience.com/podcast/podcast-evolutionary-programming

4) This Week in Machine Learning

#266 - Can we trust scientific discoveries made using machine learning

https://twimlai.com/twiml-talk-266-can-we-trust-scientific-discoveries-made-using-machine-learning-with-genevera-allen/

#288 Automated ML for RNA Design https://twimlai.com/twiml-talk-288-automated-ml-for-rna-design-with-danny-stoll/

5) O'Reilly Data Show

How social science research can inform the design of AI systems https://www.oreilly.com/radar/podcast/how-social-science-research-can-inform-the-design-of-ai-systems/

6) Talk Python to Me

7) Bioinformatics Chat

#37 Causality and potential outcomes with Irineo Cabreros - https://bioinformatics.chat/potential-outcomes

8) EconTalk

Jim Manzi 'Uncontrolled' - https://www.econtalk.org/manzi-on-knowledge-policy-and-uncontrolled/

Jim Manzi - Oregon Medicaid Experiment - https://www.econtalk.org/jim-manzi-on-the-oregon-medicaid-study-experimental-evidence-and-causality/

Susan Athey - Big Data and Causality https://www.econtalk.org/susan-athey-on-machine-learning-big-data-and-causation/

Josh Angrist - Econometrics and Causation https://www.econtalk.org/joshua-angrist-on-econometrics-and-causation/

Andrew Gelman - Social Science, Small Samples, and the Garden of Forking Paths https://www.econtalk.org/andrew-gelman-on-social-science-small-samples-and-the-garden-of-the-forking-paths/

James Heckman - Facts, Evidence, and the State of Econometrics https://www.econtalk.org/james-heckman-on-facts-evidence-and-the-state-of-econometrics/

See also:

Data Cleaning

Got Data? Probably not like your econometrics textbook!

In God we trust, all others show me your code.

Data Science, 10% inspiration, 90% perspiration

The Internet of Things, Big Data, and John Deere

Big Ag Meets Big Data (Part 1 & Part 2)

Big Data- Causality and Local Expertise are Key in Agronomic Applications
BMC Proceedings: A comparison of random forests, boosting and support vector machines for genomic selection

Wednesday, December 11, 2019

When Wicked Problems Meet Biased Data

In "Dissecting racial bias in an algorithm used to manage the health of populations" (Science, Vol 366 25 Oct. 2019) the authors discuss inherent racial bias in widely adopted algorithms in healthcare. In a nutshell these algorithms use predicted cost as a proxy for health status. Unfortunately, in healthcare, costs can proxy for other things as well:

"Black patients generate lesser medical expenses, conditional on health, even when we account for specific comorbidities. As a result, accurate prediction of costs necessarily means being racially biased on health."

So what happened? How can it be mitigated? What can be done going forward?

In data science, there are some popular frameworks for solving problems. One widely known approach is the CRISP-DM framework. Alternatively, in The Analytics Lifecycle Toolkit a similar process is proposed:

(1) - Problem Framing
(2) - Data Sense Making
(3) - Analytics Product Development
(4) - Results Activation

The wrong turn in Albuquerque here may have been at the corner of problem framing and data understanding or data sense making.

The authors state:

"Identifying patients who will derive the greatest benefit from these programs is a challenging causal inference problem that requires estimation of individual treatment effects. To solve this problem health systems make a key assumption: Those with the greatest care needs will benefit the most from the program. Under this assumption, the targeting problem becomes a pure prediction public policy problem."

The distinctions between 'predicting' and 'explaining' have been made in the literature by multiple authors in the last two decades. The problem with this substitution has important implications. To quote Galit Shmueli:

"My thesis is that statistical modeling, from the early stages of study design and data collection to data usage and reporting, takes a different path and leads to different results, depending on whether the goal is predictive or explanatory."

Almost a decade before, Leo Brieman encouraged us to think outside the box when solving problems by considering multiple approaches:

"Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed."

A number of data analysts today may not be cognizant of the differences in predictive vs explanatory modeling and statistical inference. It may not be clear to them how that impacts their work. This could be related to background, training, or the kinds of problems they have worked on given their experience. It is also important that we don't compartmentalize so much that we miss opportunities to approach our problem from a number of different angles (Leo Breiman's 'straight jacket') This is perhaps what happened in the Science article, once the problem was framed as a predictive modeling problem other modes of thinking may have shut down even if developers were aware of all of these distinctions.

The take away is that we think differently when doing statistical inference/explaining vs. predicting or doing machine learning. Making the substitution of one for the other impacts the way we approach the problem (things we care about, things we consider vs. discount etc.) and this impacts the data preparation, modeling, and interpretation.

For instance, in the Science article, after framing the problem as a predictive modeling problem, a pivotal focus became the 'labels' or target for prediction.

"The dilemma of which label to choose relates to a growing literature on 'problem formulation' in data science: the task of turning an often amorphous concept we wish to predict into a concrete variable that can be predicted in a given dataset."

As noted in the paper 'labels are often measured with errors that reflect structural inequalities.'

Addressing the issue with label choice can come with a number of challenges briefly alluded to in the article:

1) deep understanding of the domain - i.e subject matter expertise
2) identification and extraction of relevant data - i.e. data engineering and data governance
3) capacity to iterate and experiment - i.e. understanding causality, testing and measurement strategy

Data science problems in healthcare are wicked problems defined by interacting complexities with social, economic, and biological dimensions that transcend simply fitting a model to data. Expertise in a number of disciplines is required.

Bias in Risk Adjustment

In the Science article, the specific example was in relation to predictive models targeting patients for disease management programs. However, there are a number of other predictive modeling applications where these same issues can be prevalent in the healthcare space.

In Fair Regression for Health Care Spending, Sherri Rose and Anna Zink discuss these challenges in relation to popular regression based risk adjustment applications. Aligning with the analytics lifecycle discussed above, they point out there are several places where issues of bias can be addressed including pre-processing, model fitting, and post processing stages of analysis. In this article they focus largely on the modeling stage leveraging a number of constrained and penalized regression algorithms designed to optimize fairness. This work looks really promising, but the authors point out a number of challenges related to scalability and optimizing fairness across a number of metrics or groups.

Toward Causal AI and ML

Previously I referenced Galit Shmueli's work that discussed how differently we approach and think about predictive vs explanatory modeling. In the Book of Why, Judea Pearl discusses causal inferential thinking:

"Causal Analysis is emphatically not just about data; in causal analysis we must incorporate some understanding of the process that produces the data and then we get something that was not in the data to begin with."

There is currently a lot of work fusing machine learning and causal inference that could create more robust learning algorithms. For example, Susan Athey's work with causal forests, Leon Bottou's work related to causal invariance, and Elias Barenboim's work on the data fusion problem. This work, including the kind of work mentioned before related to fair regression will help inform the next generation of predictive modeling, machine learning, and causal inference models in the healthcare space that hopefully will represent a marked improvement over what is possible today.

However, we can't wait half a decade or more while the theory is developed and adopted by practitioners. In the Science article, the authors found alternative metrics for targeting disease management programs besides total costs that calibrate much more fairly across groups. Bridging the gap in other areas will require a combination of awareness of these issues and creativity throughout the analytics product lifecycle. As the authors conclude:

"careful choice can allow us to enjoy the benefits of algorithmic predictions while minimizing the risks."

References and Additional Reading:

This paper was recently discussed on the Casual Inference podcast.

Annual Review of Public Health 2020 41:1

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1, 206–215 (2019) doi:10.1038/s42256-019-0048-x

Breiman, Leo. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist. Sci. 16 (2001), no. 3, 199--231. doi:10.1214/ss/1009213726. https://projecteuclid.org/euclid.ss/1009213726

Shmueli, G., "To Explain or To Predict?", Statistical Science, vol. 25, issue 3, pp. 289-310, 2010.

Fair Regression for Health Care Spending. Anna Zink, Sherri Rose. arXiv:1901.10566v2 [stat.AP]