Sunday, February 24, 2019

The Multiplicity of Data Science

There was a really good article on LinkedIn some time ago regarding how Airbnb classifieds its data science roles:

"The Analytics track is ideal for those who are skilled at asking a great question, exploring cuts of the data in a revealing way, automating analysis through dashboards and visualizations, and driving changes in the business as a result of recommendations. The Algorithms track would be the home for those with expertise in machine learning, passionate about creating business value by infusing data in our product and processes. And the Inference track would be perfect for our statisticians, economists, and social scientists using statistics to improve our decision making and measure the impact of our work."

I think this helps tremendously to clarify thinking in this space.

Sunday, February 17, 2019

Was It Meant to Be? OR Sometimes Playing Match Maker Can Be a Bad Idea: Matching with Difference-in-Differences

Previously I discussed the unique aspects of modeling claims and addressing those with generalized linear models. I followed that with a discussion of the challenges of using difference-in-differences in the context of GLM models and some ways to deal with this. In this post I want to dig into into what some folks are debating in terms of issues related to combining matching with DID. Laura Hatfield covers it well on twitter:


Also, they picked up on this it at the incidental economist and gave a good summary of the key papers here.

You can find citations for the relevant papers below. I won't plagerize what both Laura and the folks at the Incidental Economist have already explained very well. But, at a risk of oversimplifying the big picture I'll try to summarize a bit. Matching in a few special cases can improve the precision of the estimate in a DID framework,  and occasionally reduces bias. Remember, that  matching on pre-period observables is not necessary for the validity of difference in difference models. There are  cases when the treatment group is in fact determined by pre-period outcome levels. In these cases matching is necessary. At other times, if not careful, matching in DID introduces risks for regression to the mean…what Laura Hatfield describes as a ‘bounce back’ effect in the post period that can generate or inflate treatment effects when they do not really exist.

Both the previous discussion on DID in a GLM context and combining matching with DID indicate the risks involved in just plug and play causal inference and the challenges of bridging the gap between theory and application.


Daw, J. R. and Hatfield, L. A. (2018), Matching and Regression to the Mean in Difference‐in‐Differences Analysis. Health Serv Res, 53: 4138-4156. doi:10.1111/1475-6773.12993

Daw, J. R. and Hatfield, L. A. (2018), Matching in Difference‐in‐Differences: between a Rock and a Hard Place. Health Serv Res, 53: 4111-4117. doi:10.1111/1475-6773.13017

Thursday, January 24, 2019

Modeling Claims with Linear vs. Non-Linear Difference-in-Difference Models

Previously I have discussed the issues with modeling claims costs. Typically medical claims exhibit non-negative highly skewed values with high zero mass and heterskedasticity. The most commonly suggested approach to addressing these distributional concerns in the literature call for the use of non-linear GLM models.  However, as previously discussed (see here and here) there are challenges with using difference-in-difference models in the context of GLM models. So once again, the gap between theory and application presents challenges, tradeoffs, and compromises that need to be made by the applied econometrician.

In the past I have written about the accepted (although controversial in some circles) practice of leveraging linear probability models to estimate marginal effects in applied work when outcomes are dichotomous. But what about doing this in the context of claims analysis? In my original post regarding the challenges of using difference-in-differences with claims I speculated:

"So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For count outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative."

 I have found that this advice is pretty consistent with the social norms and practices in the field.

In their analysis of the ACA Cantor, et al (2012) leverage linear probability models for difference-in-differences for healthcare utilization stating:

"Linear probability models are fit to produce coefficients that are direct estimates of the relevant policy impacts and are easily interpreted as percentage point changes in coverage outcomes. This approach has been applied in earlier evaluations of insurance market reforms (Buchmueller and DiNardo 2002; Monheit and Steinberg Schone 2004;  Levine, McKnight, and Heep 2011;  Monheit et al. 2011). It also avoids complications associated with estimation and interpretation of multiple interaction terms and their standard errors in logit or probit models (Ai and Norton 2003)."

Jhamb et al (2015) use LPMs for dichotomous outcomes as well as OLS models for counts in a DID framework.

Interestingly, Deb and Norton (2018) discuss an approach to address the challenges of DID in a GLM framework head on:

"Puhani argued, using the potential outcomes framework, that the treatment effect on the treated in the difference-in-difference regression equals the expected value of the dependent variable for the treatment group in the post period with treatment compared with the hypothetical expected value of the dependent variable for the treatment group in the post period if they had not received treatment. In nonlinear models, the treatment effect on the treated equals the difference in two predicted values. It always has the same sign as the coefficient on the interaction term. Because we estimate many nonlinear models using a difference-in-differences study design, we report the treatment effect on the treated in all tables of results."

In presenting their results they compare their GLM based approach to results from linear models of healthcare expenditures. While they argue the differences are substantial in supporting their approach, I did not find the OLS estimate (-$323.4) to be practically different from the second part (conditional on positive) of the two part GLM model (-$321.4), although the combined results from the two part model had large practical differences from OLS. It does not appear they compared a two-part GLM to a two-part linear model (which could be problematic if the first part OLS model gave probabilities greater than 1 or less than zero). In their paper they cited a number of authors using linear difference-in-differences to model claims you will find below.

See the references below for a number of examples (including those cited above).

Related: Linear Literalism and Fundamentalist Econometrics


Cantor JC, Monheit AC, DeLia D, Lloyd K. Early impact of the Affordable Care Act on health insurance coverage of young adults. Health Serv Res. 2012;47(5):1773-90.

Modeling Health Care Expenditures and Use
Partha Deb and Edward C. Norton
Annual Review of Public Health 2018 39:1, 489-505

Buchmueller T, DiNardo J. “Did Community Rating Induce an Adverse Selection Death Spiral? Evidence from New York, Pennsylvania and Connecticut” American Economic Review. 2002;92(1):280–94.

Monheit AC, Cantor JC, DeLia D, Belloff D. “How Have State Policies to Expand Dependent Coverage Affected the Health Insurance Status of Young Adults?” Health Services Research. 2011;46(1 Pt 2):251–67

Amuedo-Dorantes C, Yaya ME. 2016. The impact of the ACA’s extension of coverage to dependents on young adults’ access to care and prescription drugs. South. Econ. J. 83:25–44

Barbaresco S, Courtemanche CJ, Qi Y. 2015. Impacts of the Affordable Care Act dependent coverage provision on health-related outcomes of young adults. J. Health Econ. 40:54–68

Jhamb J, Dave D, Colman G. 2015. The Patient Protection and Affordable Care Act and the utilization of health care services among young adults. Int. J. Health Econ. Dev. 1:8–25

Sommers BD, Buchmueller T, Decker SL, Carey C, Kronick R. 2013. The Affordable Care Act has led
to significant gains in health insurance and access to care for young adults. Health Aff. 32:165–74

Modeling Healthcare Claims as a Dependent Variable

Healthcare claims present challenges to the applied econometrician. Claims costs typically exhibit a large number of zero values (high zero mass), extreme skewness, and heteroskedasticity. Below is a histogram depicting the distributional properties typical of claims data.

The literature (see references below) addresses a number of approaches (i.e. log models, GLM, and two part models) often used for modeling claims data. However, without proper context the literature can leave one with a lot of unanswered questions, or several seemingly plausible answers to the same question.

The department of Veteran's Affairs runs a series of healthcare econometrics cyberseminars covering these topics. Particularly, they have two video lectures devoted to modeling healthcare costs as a dependent variable.

Principles discussed include:

1) Despite what is taught in a lot of statistics classes about skewed data, in claims analysis we usually DO want to look at MEANS not MEDIANS.

2) Why logging claims and then running analysis on the logged data to deal with skewness is probably not the best practice in this context.

3) How adding a small constant number to zero values prior to logging can lead to estimates that are very sensitive to the choice of constant value.

4) Why in many cases it could be a bad idea to exclude ‘high cost claimants’ from an analysis without good reasons. This probably should not be an arbitrary routine practice.

5)When and why you may or may not prefer ‘2-part models’

Note: Utilization data like ER visits, primary care visits and hospital admissions are also typically non-negative and skewed with high mass points.  Utilization can be modeled as counts using poisson, negative binomial, or zero-inflated poisson and zero inflated negative binomial models in a GLM framework although not discussed here.


Mullahy, John. "Much Ado Abut Two: Reconsidering Retransformation And The Two-Part Model In Health Econometrics," Journal of Health Economics, 1998, v17(3,Jun), 247-281.

Liu L, Cowen ME, Strawderman RL, Shih Y-CT. A Flexible Two-Part Random Effects Model for Correlated Medical Costs. Journal of health economics. 2010;29(1):110-123. doi:10.1016/j.jhealeco.2009.11.010.

Too much ado about two-part models
and transformation? Comparing methods of modeling Medicare expenditures
Melinda Beeuwkes Buntin a,∗, Alan M. Zaslavsky
Journal of Health Economics 23 (2004) 525–542

Health Econ. 20: 897–916 (2011)

Generalized modeling approaches to risk adjustment of skewed outcomes data.
J Health Econ. 2005 May;24(3):465-88.
Manning WG1, Basu A, Mullahy J.

Econometric Modeling of Health Care Costs and Expenditures: A Survey of Analytical Issues and Related Policy Considerations . John Mullahy. Medical Care. Vol. 47, No. 7, Supplement 1: Health Care Costing: Data, Methods, Future Directions (Jul., 2009), pp. S104-S108

Analyzing Health Care Costs: A Comparison of
Statistical Methods Motivated by Medicare Colorectal Cancer Charges. MICHAEL GRISWOLD, GIOVANNI PARMIGIANI,ARNIE POTOSKY,JOSEPH LIPSCOMB. Biostatistics (2004), 1, 1, pp. 1–23

Estimating log models: to transform or not to transform? Willard G. Manning and John Mullahy. Journal of Health Economics 20 (2001) 461–494

Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.

P Dier, D Yanez, A Ash, M Hornbrook, DY Lin. Methods for analyzing health care utilization and costs Ann Rev Public Health (1999) 20:125-144
Lachenbruch P. A. 2001. “Comparisons of two-part models with competitors” Statistics in Medicine, 20:1215–1234.

Lachenbruch P.A. 2001. “Power and sample size requirements for two-part models” Statistics in Medicine, 20:1235–1238.

 Diehr,P. ,Yanez,D. Ash, A. Hornbrook, M. & Lin, D. Y. 1999 “Methods for analyzing health care utilization and costs.” Annu. Rev. Public Health, 20:125–44.

Friday, December 21, 2018

Thinking About Confidence Intervals: Horseshoes and Hand Grenades

In a previous post, Confidence Intervals: Fad or Fashion I wrote about Dave Giles' post on interpreting confidence intervals. A primary focus of these discussions was how confidence intervals are often mis-interpreted. For instance the two statements below are common mischaracterizations of CIs:

1) There's a 95% probability that the true value of the regression coefficient lies in the interval [a,b].
2) This interval includes the true value of the regression coefficient 95% of the time.

You can read the previous post or Dave's post for more details. But in re-reading Dave's post myself recently one statement had me thinking:

"So, the first interpretation I gave for the confidence interval in the opening paragraph above is clearly wrong. The correct probability there is not 95% - it's either zero or 100%! The second interpretation is also wrong. "This interval" doesn't include the true value 95% of the time. Instead, 95% of such intervals will cover the true value."

I like the way he put that...'95% of such intervals' distinguishing this from a particular observed/calculated confidence interval. I think someone trained to think about CIs in the incorrect probabilistic way may have trouble getting at this. So how might we think about this in a way that captures CIs in a way that is still useful, but doesn't get us tripped up with incorrect probability statements?

My favorite statistics text is Degroot's Probability and Statistics. In the 4th edition they are very careful about explaining confidence intervals:

"Once we compute the observed values of a and b, the observed interval (a,b) is not so easy to interpret....Before observing the data we can be 95% confident that the random interval (A,B) will contain mu, but after observing the data, the safest interpretation is that (a,b) is simply the observed value of the random interval (A,B)"

While Degroot is careful, it still may not be very intuitive. However, in Principles and Procedures of Statistics: A Biometrical Approach (Steel, Torie, and Dickey) they present a more intuitive explanation.

"since mu will either be or not be in the interval, that is P=0 or 1, the probability will actually be a measure of confidence we placed in the procedure that led to the statement. This is like throwing a ring at a fixed post; the ring doesn't land in the same position or even catch on the post every time. However we are able to say that we can circle the post 9 times out of 10, or whatever the value should be for the measure of our confidence in our proficiency."

The ring tossing analogy seems to work pretty well. I'll customize it by using horseshoes instead. Yes 95 out of 100 times you might throw a ringer (in the game of horseshoes that is when the horse shoe circles the peg or stake when you toss it). You know this before you toss it. And to use Dave Giles language, *before* calculating a confidence interval we know that 95% of such intervals will cover the population parameter of interest. And, after we toss the shoe, it either circles the peg or not, that is a 1 or a 0 in terms of probability. Similarly, *after* computing a confidence interval, the true mean or population parameter of interest is covered or not with a probability of 0 or 100%.

This isn't perfect, but thinking of confidence intervals this way at least keeps us honest about making probability statements.

Going back to my previous post, I still like the description of confidence intervals Angrist and Pishke provide in Mastering 'Metrics, that is 'describing a set of parameter values consistent with our data.' 

For instance if we run the regression:

y = b0 + b1X + e  to estimate y = B0 + B1 + e

and get our parameter estimate b with a 95% confidence interval like (1.2,1.8), we can say that our sample data is consistent with any population that has a B taking a value that falls in the interval. That implies there are a number of populations that our data would be consistent with. Narrower intervals imply very similar populations, very similar values of B, and speaks to more precision in our estimate of B.

I really can't make an analogy for hand grenades. It just gave me a title with a ring to it.

See also:
Interpreting Confidence Intervals
Bayesian Statistics Confidence Intervals and Regularization
Overconfident Confidence Intervals

Saturday, October 20, 2018

Power and Sample Size Analysis in Applied Econometrics

In applied work in econometrics I've done a limited amount of power and sample size analysis. Recently I was thinking about a conversation from an episode of the EconTalk podcast with Russ Roberts and John Ioannidis where the topic of power came up:

“though I was trained as a Ph.D., got a Ph.D. in economics at the U. of Chicago, I never heard that phrase, 'power,' applied to a statistical analysis. What we did--and I think what most economists, many economists, still do, is: we had a data set; we had something we wanted to discover and test or examine or explore, depending on the nature of the problem.”

That rings familiar to me. In eight years of attending talks and seminars in applied economics, what stands out are discussions of identification, endogeneity, standard errors etc. Not power or sample size. So I went back and looked at all of my copies of econometrics textbooks. These are well known and have been commonly used by masters and PhD graduate students in economics. Econometric Analysis by Greene, Econometric Analysis of Cross Section and Panel Data by Wooldridge,  A Course in Econometrics by Goldberger, A Guide to Econometrics by Kennedy, Using Econometrics by Studenmund. I even threw in Mastering 'Metrics and Mostly Harmless Econometrics by Angrist and Pischke.

While Wooldridge did discuss clustering and stratified sampling, most of the emphasis was placed on getting the correct standard errors and appropriate weighting. From my previous years of referencing these texts, as well as a cursory review again of the index and chapters of each one I could not find any treatment of power or sample size calculations.

So I thought, maybe this is something covered in prerequisite courses. Going back to the undergraduate level in economics I recall very little about this. Checking a popular text, Statistics for Business and Economics by Anderson, Sweeney, Williams, Camm, and Cochran I did find a basic example in relation to power and sample sizes for a t-test.  What about a graduate level pre-requisite for econometrics? In my first year of graduate school I took a graduate level course in mathematical statistics (this was a course doing business under a research methods title) that used Degroot's text Probability and Statistics. Definitely a lot about the concept of power in theory, but no emphasis on various calculations for sample size. The one textbook I own with treatment of this is Principles and Procedures of Statistics, A Biometrical Approach by Steel, Torrie, and Dickey. But that does not count because that was the text used in my experimental design course in graduate school. Not part of a standard econometrics curriculum.

I've come to the conclusion that power and sample size analysis may not be widely emphasized in graduate econometrics training across the board in all programs. It's not something missed in a lecture a decade ago. Similar to advanced specialized topics like spatial econometrics, details related to power and sample size analysis, survey design, stratified random sampling etc. are likely covered depending on one's specialty in the field and the program.

However,  it is evident that some economists do this kind of work.

For instance, here is an example from a paper with food economist Jayson Lusk:

"However, there are many economic problems where sample size directly affects a benefit or loss function. In these cases, sample size is an endogenous variable that should be considered jointly with other choice variables in an optimization problem. In this article we introduce an economic approach to sample size determination utilizing a Bayesian decision theoretic framework."

As well as healthcare economist Austin Frakt. 

So why do we care about power and sample size and what is 'power'?

Jim Manzi, Author of Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society offers the following analogy in an Econ Talk podcast:

“Well, the power in a statistical experiment, and I often use this analogy, is sort of like the magnification power on the microscope you probably used in high school biology. It has on the side, 4x, 8x, 16x, which is how many times it can increase the apparent size of a physical object. And the metaphor I'd use is, if I try and use a child's microscope to carefully observe a section of a leaf looking for an insect that's a little smaller than an ant, and I don't observe the ant, I can reliably say: I don't see the insect, and therefore there is no bug there. If I use that exact same microscope to try and find on that exact same piece of leaf, not a bug but a tiny microbe that's, you know, smaller than a speck of dust, I'll look at it and I'll say: it's all kind of fuzzy, I see a lot of squiggly things; I think that little squiggle might be something or it might not. I don't see the microbe, but I can't reliably say that therefore there is no microbe there, because trying to zoom in closer and closer to look for something that small, all I see is a bunch of fuzz. So my failure to see the microbe is a statement about the precision of my instrument, not about whether there's really a microbe on the leaf.”

So, if we have a sample that is ‘not sufficiently powered’ it is possible that we could fail to find a relationship between treatment and outcome, even if one actually exists. Equivalently, our estimated regression coefficient may not be statistically significant when a relationship actually does exist. Increasing sample size is one primary way to increase power in an experiment. So the question becomes how large does ‘n’ have to be to have a sample sufficiently powered to detect the effect of a treatment on an outcome (at some stated level of significance)?

So how do you do these calculations? If you can't find examples in your econometrics textbook (if you do find one let me know!) there are plenty of texts in the biostatistics genre that probably cover this. Principles and Procedures of Statistics, A Biometrical Approach by Steel, Torrie, and Dickey is one example that I started with. Cochran, W (1977). Sampling. Techniques, 3rd ed. is another often cited source.

See also: Andrew Gelman on Econtalk discussing "what does not kill my statistical significance makes it stronger"

Sunday, July 29, 2018

Performance of Machine Learning Models on Time Series Data

In the past few years there has been an increased interest among economists in machine learning. For more discussion see herehere, here, here, here, here, here,  and here.  See also Mindy Mallory's recent post here.

While some folks like Susan Athey are beginning to develop the theory to understand how machine learning can contribute to causal inference, it has carved out a niche in the area of prediction. But what about times series analysis and forecasting?

That is a question taken up by authors this past March in an interesting paper (Statistical and Machine Learning forecasting methods: Concerns and ways forward). They took a good look at the performance of popular machine learning algorithms relative to traditional statistical time series approaches. The authors found that traditional approaches including exponential smoothing and econometric time series approaches out performed algorithmic approaches from machine learning across a number of model specifications, algorithms, and time series data sources.

Below are some interesting excerpts and takeaways from the paper:

When I think of time series methods, I think of things like cointegration, stationarity, autocorrelation, seasonality, auto-regressive conditional heteroskedasticity etc. (I recommend Mindy Mallory's posts on time series here)

Hearing so much about the ability of some machine learning approaches (like deep learning) to mimick feature engineering, I wondered how well algorithmic approaches would handle these issues in time series applications. The authors looked at some of the previous literature in relation to this:

"In contrast to sophisticated time series forecasting methods, where achieving stationarity in both the mean and variance is considered essential, the literature of ML is divided with some studies claiming that ML methods are capable of effectively modelling any type of data pattern and can therefore be applied to the original data [62]. Other studies however, have concluded the opposite, claiming that without appropriate preprocessing, ML methods may become unstable and yield suboptimal results [28]."

One thing about this paper, as I read it, is that it does not take an adversarial or luddite tone toward machine learning methods in favor of more traditional approaches. While they found challenges related to predictive accuracy, they seemed to proactively look deeper to understand why ML algorithms performed the way they did and how to make ML approaches better at time series.

One of the challenges with ML, even with crossvalidation was overfitting and confusion of signals, patterns, and noise in the data:

"An additional concern could be the extent of randomness in the series and the ability of ML models to distinguish the patterns from the noise of the data, avoiding over-fitting....A possible reason for the improved accuracy of the ARIMA models is that their parameterization is done through the minimization of the AIC criterion, which avoids over-fitting by considering both goodness of fit and model complexity."

They also recommend instances where ML methods may offer advantages:

"even though M3 might be representative of the reality when it comes to business applications, the findings may be different if nonlinear components are present, or if the data is being dominated by other factors. In such cases, the highly flexible ML methods could offer significant advantage over statistical ones"

It was interesting that basic exponential smoothing approaches outperformed much more complicated ML methods:

"the only thing exponential smoothing methods do is smoothen the most recent errors exponentially and then extrapolate the latest pattern in order to forecast. Given their ability to learn, ML methods should do better than simple benchmarks, like exponential smoothing."

However the authors note it is often the case that smoothing methods can offer advantages over more complex econometric time series as well (i.e. ARIMA, VAR, GARCH etc.)

Toward the end of the paper the authors go on to discuss in detail the differences in the domains where we have seen a lot of success in machine learning (speech and image recognition, games, self driving cars etc. ) vs. time series and forecasting applications.

In table 10 of the paper, they drill into some of these specific differences and discuss structural instabilities related to time series data, how the 'rules' change and how forecasts themselves can influence future values, and how this kind of noise might be hard for ML algorithms to capture.

This paper is definitely worth going through again and one to keep in mind if you are about to embark on an applied forecasting project.


Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 13(3): e0194889.

See also Paul Cuckoo's LinkedIn post on this paper: