Friday, April 19, 2019

Intent to Treat, Instrumental Variables and LATE Made Simple(er)

Many times in a randomized controlled trial (RCT) issues related to non-compliance arise. Subjects assigned to the treatment fail to comply, while in other cases subjects that were supposed to be in the control group actually receive treatment. One way to deal with non-compliance is through an intent-to-treat framework (ITT)

Gupdta describes ITT:

"ITT analysis includes every subject who is randomized according to randomized treatment assignment. It ignores noncompliance, protocol deviations, withdrawal, and anything that happens after randomization. ITT analysis is usually described as “once randomized, always analyzed."

In Mastering Metrics, Angrist and Pischke describe intent-to-treat analysis:

"In randomized trials with imperfect compliance, when treatment assignment differs from treatment delivered, effects of random assignment...are called intention-to-treat (ITT) effects. An ITT analysis captures the causal effect of being assigned to treatment."

While treatment assignment is random, non-compliance is not! Therefore if instead of using intent to treat comparisons we compared those actually treated to those untreated (sometimes termed 'as treated' analysis) we would get biased results. When there is non-compliance, there is the likelihood that a relationship exists between potential outcomes and the actual treatment received. While the ITT approach gives an unbiased causal estimate of the treatment effect, it is often a diluted effect because of non-compliance issues and can provide an underestimate of the true effect (Angrist, 2006).

Angrist and Pishke discuss how instrumental variables can be used in the context of a RCT with non-compliance issues:

 "Instrumental variable methods allow us to capture the causal effect of treatment on the treated in spite of the nonrandom compliance decisions made by participants in experiments....Use of randomly assigned intent to treat as an instrumental variable for treatment delivered eliminates this source of selection bias." 

The purpose of this post is to build intuition related to how an instrumental variable (IV) approach differs from ITT, and how it is not biased by selection related to non-compliance issues in the same way that an 'as treated' analysis would be.

My goal is to demonstrate with a rather simple data set how IVs tease out the biases from non-compliance and give us only the impact of treatment on the compliers also known as the local average treatment effect (LATE).

A great example of IV and ITT applied to health care can be found in Finkelstein et. al. (2013 & 2014) - See The Oregon Medicaid Experiment, Applied Econometics, and Causal Inference.

For another post walking through the basic mechanics of instrumental variables (IV) estimation using a toy data set see: A Toy IV Application.

Key Assumptions

Depending on how you frame it there are about 5 key things (assumptions if we want to call them that) we need to think about when leveraging instrumental variables - in humble language:

1) SUTVA - you can look that up but basically it means no interactions or spillovers between the treatments and controls - my getting treated does not make a control case have a better or worse outcome as a result

2) Random Assignment - that is the whole context of the discussion above - the instrument (Z) or treatment assignment must be random

3) The Exclusion Restriction - Treatment assignment impacts outcome only through the treatment itself. It is the treatment that impacts the outcome. There is nothing about being in the randomly assigned treatment group that would cause your outcome to be higher or lower in and of itself, other than actually receiving the treatment.  Treatment assignment is ignorable. This is often represented as: Z -> D -> Y where Z is the instrument or random assignment, D is an indicator for actually receiving the treatment, and Y is the outcome.

4) Non-zero causal effect of Z on D: Being assigned to the treatment group is highly correlated with actually receiving the treatment i.e. when Z =1 then D is usually 1 as well. (if these were perfectly correlated that would imply perfect compliance)

5) Monotonicity - We'll just call this an assumption of 'no-defiers.' It means that there are no cases that always do the opposite of what their treatment assignment indicates, i.e. if Z = 1 then D = 0 AND if Z =0 then D is always 1. Stated differently  we can't have cases where there are those that always get the treatment when assigned to the control group and never receive treatment when assigned to the treatment group.

Types of Non-Compliance

Given these assumptions, with monotonicity we end up with three different groups of people in our study:

Never Takers: those that refuse treatment regardless of treatment/control assignment.

Always Takers: those that get the treatment even if they are assigned to the control group.

Compliers: those that comply or receive treatment if assigned to a treatment group but do not receive treatment when assigned to control group.

The compliers are characterized as participants that receive treatment only as a result of random assignment. The estimated treatment effect for these folks is often very desirable and in an IV framework can give us an unbiased causal estimate of the treatment effect.  But how does this work?


I have to first recommend a great post over at titled '10 Things to Know About Local Average Treatment Effects.' Most of my post is based on those well thought out examples.

Just to level set, the context of this discussion going forward is a RCT with the outcome measured as Y, and treatment assignment being used as the instrument Z. (this can be extended to apply to other scenarios using other types of instruments). Actual receipt of treatment, or treatment status, is indicated by D with D=1 indicating a receipt of treatment. So an ITT analysis would simply be a comparison of outcomes for folks randomly assigned to treatment (Z = 1) vs those that were controls (Z = 0) regardless of compliance or non-compliance (determined by D). An 'as treated' analysis would be a comparison of everyone that received the treatment (D = 1) vs. those that did not (D=0) regardless of randomization. This is a biased analysis. The IV or local average treatment effect (LATE) estimate is the difference in outcomes for compliers.

Going back to the original article by Angrist (1996), it discusses IVs, LATEs and the types of noncompliance as they relate to the assumptions we previously discussed. In that article they explain that the treatment status (D) of the always takers and never takers is invariant (uncorrelated) to random assignment Z. No matter what Z is, they are going to do what they are going to do.  But, we also know that Z  (by definition of compliance and assumption 4) is correlated with actual treatment assignment D for the compliers.

Lets consider a RCT with one sided non-compliance. In this case the controls are not able to receive the treatment by nature of the design. So there are no 'always takers' in this discussion. Below is a table summarizing a scenario like this with 100 people randomly assigned to treatment (Z = 1) and 100 controls (Z = 0). (This can be extended to include always takers and the post I mentioned before at will walk through that scenario)

Z = 1Z = 0
Never TakerNever Taker
N = 20N = 20
D = 0D = 0
Y = 5Y = 5
N =80N =80
D = 1D = 0
Y = 25Y = 20

For story telling purposes, let's assume the 'treatment' is a weight loss program. We've got some really unmotivated folks (never takers) in both the treatment and control group that just don't comply with the treatment. Let's say on average they all end up losing 5 pounds (Y = 5) regardless of the group they are in. On the other hand, we have more conscientious folks that if randomly assigned to treatment they will participate. But they are motivated and healthy. Even in absence of treatment their potential outcomes (weight loss) are pretty favorable. They are bound to lose 20 pounds even in absence of treatment.

As discussed before, we  can see how when there is non-compliance, there is the likelihood that a relationship exists between potential outcomes and the actual treatment received.

If we ignore treatment assignment, and just compare the average weight lost (y) for those that received treatment to all of those that did not we could run the following regression:

Y = β0 + β1 D + e      

with β1 = 10 (see the R code  that generates this data and these results)

We could calculate this by hand as: 25 - [(2/3)*20 + (1/3)*5)] = 10

We know that non-compliance biases this estimate.

The ITT estimate can be estimated as:

Y = β0 + β1 Z + e  

with β1 =  4

We can see from the data this is simply the difference in means between the treatment and control group: [.2*5 + .8*25] - [.2*5 + .8*20] = 21-17 = 4

We know from the discussion above and can see from the data that this is greatly diluted by noncompliance. But because of randomization this is an unbiased estimate.

Finally, the IV or local average treatment effect (LATE) estimate is the difference in outcomes for compliers.

Because our example above is contrived, the outcomes for the compliers is made explicit in the table above. If you know exactly who the compliers are the math would be straight forward:

LATE = 25 - 20 = 5

You can also get LATEs by dividing the ITT effect by the share of compliers:

4/.8 = 5

In a previous post, I've described how an IV estimate teases out only that variation in our treatment D that is unrelated to selection bias and relates it to Y giving us an estimate for the treatment effect of D that is less biased.

We can view this through the lens of a 2SLS modeling strategy:

Stage 1: Regress D on Z to get D*

D* = β0 + β1 Z + e

β1 only picks up the variation in Z that is related to D (i.e. quasi-experimental variation) and leaves all of the variation in D  related to non-compliance and selection in the residual term.  You can think of this as working like a filtering process.

Stage 2: Regress Y on D*

Y = β0 +βIV D* + e  

The second stage relates changes in Z (quasi-experimental variation) to changes in our target Y.

We can see (from the R code below) that our estimate βIV   = 5.

We can also get the same result (and correct standard errors) by using the ivreg function from the AER package in R:

summary(ivreg(y ~ D | Z,data =df))



Angrist, Joshua D., et al. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association, vol. 91, no. 434, 1996, pp. 444–455. JSTOR,

Angrist, J.D. J Exp Criminol (2006) 2: 23.

"The Oregon Experiment--Effects of Medicaid on Clinical Outcomes," by Katherine Baicker, et al. New England Journal of Medicine, 2013; 368:1713-1722.

Medicaid Increases Emergency-Department Use: Evidence from Oregon's Health Insurance Experiment. Sarah L. Taubman,Heidi L. Allen, Bill J. Wright, Katherine Baicker, and Amy N. Finkelstein. Science 1246183Published online 2 January 2014 [DOI:10.1126/science.1246183]

Gupta, S. K. (2011). Intention-to-treat concept: A review. Perspectives in Clinical Research, 2(3), 109–112.

Saturday, March 30, 2019

Abandoning Statistical Significance - Or - Two Ways to Sell Snake Oil

There was recently a very good article in Nature pushing back against statistical significance and dichotomizing thresholds for p-values (i.e. p < .05). This follows the ASAs statement on the interpretation of p-values.

I've blogged before about previous efforts to pushback against p-values and proposals to focus on confidence intervals (which often just reframe the problem in other ways that get misinterpreted see hereherehere, and here). And absolutely there are problems with p-hacking, failures to account for multiple comparisons and multiple testing, and gardens of forking paths. 

To be fair, the authors in the Nature article state:

"We are not calling for a ban on P values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications (such as determining whether a manufacturing process meets some quality-control standard). And we are also not advocating for an anything-goes situation, in which weak evidence suddenly becomes credible. Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis"

While I think the letter has the potential for more good than harm, I think in the minds of the wrong people it will actually embolden overconfidence in weak evidence. This could also be abused by others wanting to escape the safeguards of scientific rigor.

Andrew Gelman, one of those signing the letter, seems to have had some similar concerns noted in his post “Retire Statistical Significance”: The discussion.  In his post he shares a number of statements in the article and how they could be misleading. We have to remember that statistics and inference can be hard. It's hard for Phds that have spent their entire lives doing this stuff. It's hard for practitioners that have made their careers out of it. So it is important to consider the ways that these statements could be interpreted by others that are not as skilled in inference and experimental design as the authors and signatories.

Gelman states:

"the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible."

In addition Gelman says:

"statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective"

An Illustration: Two Ways to Sell Snake Oil

So let me propose a fable. We will pretend in this fictional world we don't have to worry about selection bias and unobserved heterogeneity and endogeneity etc. Suppose there is a salesman with an elixir claiming it is a miracle breakthrough for weight loss. Suppose they have lots and lots of data, large sample sizes, and randomized controlled trials supporting its effectiveness. In fact, in all of their studies they find that on average, consumers using the elixir lose weight with highly statistically significant results (p < .001). Ignoring effect sizes (i.e how much weight do people actually lose on average?) the salesman touts the precision of the results and sells lots and lots of elixir based on the significance of the findings.

I have personally witnessed this kind of thing in my work and it is sad to see results passed off this way to people untrained in statistical inference or experimental design. If the salesman were willing to confess that the estimates of the effects of taking the elixir were very precise but we are precisely measuring an average loss of about 1.5 pounds per year compared to controls - it would destroy his sales pitch! Yes, effect size matters!

So now the salesman reads our favorite article in Nature. He conducts a number of additional trials. This time he's going to focus only on the effect sizes from the studies. Looking only at effect sizes, he knows that a directional finding of 1.5 pounds per year isn't going to sell. So how large does the effect need to be to take his snake oil to market with data to support it? Is 2 pounds convincing? Or 3,4,5-10? Suppose his data show an average annual loss of weight near 10 pounds greater for those using the elixir vs. a control group. He goes to market with this claim. As he is making a pitch to a crowd of potential buyers, one savvy consumer gives him a critical review asking if his results were statistically significant. The salesman having read our favorite Nature article replies that mainstream science these days is more concerned with effect sizes than dichotomous notions of statistical significance. To the crowd this sounds like a sophisticated and informed answer so that day he sells his entire stock.

Eventually someone uncovers the actual research related to the elixer. They find that yes, on average most of those studies found an effect of about 10 pounds of annual loss of weight. But the p-values associated with these estimates in these studies ranged from .25-.40. What does this mean?

P-values tell us the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

Simplifying we could say, if the elixir really is snake oil, a p-value equal to .25 tells us that there is a 25% probability that we would observe an average loss of weight equal to or greater than 10 pounds. People in the study would be likely to lose 10 pounds or more even if they did not take the elixir.

A p-value of .25 doesn't necessarily mean that the elixir is ineffective. That is sort of the point of the article in Nature. It just means that the evidence for rejecting the null hypothesis of zero effect is weak.

What if instead of selling elixir the salesman was taking bets with a two headed coin. How would we catch him in the act? What if he flipped the coin two times and got two heads in a row? (and we just lost $100 at $50/flip) If we only considered the observed outcomes, and knew nothing about the distribution of coin flips (and completely ignored intuition) we might think this is evidence of cheating. After all, two heads in a row would be consistent with a two headed coin. But I wouldn't be dialing my lawyer yet.

If we consider the simple probability distribution associated with tossing a two sided coin, we would know that there is a 50% chance of flipping a normal coin and getting heads and a 25% chance of flipping a normal coin twice and getting two heads in a row. This is roughly analogous to a p-value equal to .25. In other words, there is a good chance if our con-artist were using a fair coin he could in fact flip two heads in a row. This does not mean he is innocent, it just means that when we consider the distribution, variation, and probabilities associated with flipping coins we just don't have the precision we need to know for sure. We could say the same thing about the evidence from our fable about weight loss or any study with a p-value equal to .25.

What our snake oil elixir salesman flipped his coin 4 times and got 4 heads in a row?  The probability of 4 heads in a row is 6.25% if he has a fair coin. What about 5? Under the null hypothesis of a 'fair' coin the probability of observing an event as extreme as 5 heads in a row is 3.125%. Do we think our salesman could be that lucky and get 4 or 5 heads in a row? Many people would have their doubts. When we get past whatever threshold is required to start having doubts about the null hypothesis then intuitively we begin to feel comfortable rejecting the null hypothesis. As the article in Nature argues, this cutoff should not necessarily be 5% or p < .05. However in this example the probabilities are analogous to having p-values of .0625 and .03125 which are in the vicinity of our traditional threshold of .05. I don't think reading the article in Nature should change your mind about this.


We see with our fable, the pendulum could swing too far in either direction and lead to abusive behavior and questionable conclusions. Economist Noah Smith discussed the pushback against p-values a few years ago. He stated rightly that 'if people are doing science right, these problems won't matter in the long run.' Focusing on effect size only and ignoring distribution, variation, and uncertainty risks backsliding from the science that revolutionized the 20th century into the world of anecdotal evidence. Clearly the authors and signatories of the Nature article are not advocating this, in fact they are stated that clearly several times in the article and in the excerpts I shared above. It is how this article gets interpreted and cited that matters most.  As Gelman states:

"some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product to support its application for licensing" 

Sunday, February 24, 2019

The Multiplicity of Data Science

There was a really good article on LinkedIn some time ago regarding how Airbnb classifieds its data science roles:

"The Analytics track is ideal for those who are skilled at asking a great question, exploring cuts of the data in a revealing way, automating analysis through dashboards and visualizations, and driving changes in the business as a result of recommendations. The Algorithms track would be the home for those with expertise in machine learning, passionate about creating business value by infusing data in our product and processes. And the Inference track would be perfect for our statisticians, economists, and social scientists using statistics to improve our decision making and measure the impact of our work."

I think this helps tremendously to clarify thinking in this space.

Sunday, February 17, 2019

Was It Meant to Be? OR Sometimes Playing Match Maker Can Be a Bad Idea: Matching with Difference-in-Differences

Previously I discussed the unique aspects of modeling claims and addressing those with generalized linear models. I followed that with a discussion of the challenges of using difference-in-differences in the context of GLM models and some ways to deal with this. In this post I want to dig into into what some folks are debating in terms of issues related to combining matching with DID. Laura Hatfield covers it well on twitter:


Also, they picked up on this it at the incidental economist and gave a good summary of the key papers here.

You can find citations for the relevant papers below. I won't plagerize what both Laura and the folks at the Incidental Economist have already explained very well. But, at a risk of oversimplifying the big picture I'll try to summarize a bit. Matching in a few special cases can improve the precision of the estimate in a DID framework,  and occasionally reduces bias. Remember, that  matching on pre-period observables is not necessary for the validity of difference in difference models. There are  cases when the treatment group is in fact determined by pre-period outcome levels. In these cases matching is necessary. At other times, if not careful, matching in DID introduces risks for regression to the mean…what Laura Hatfield describes as a ‘bounce back’ effect in the post period that can generate or inflate treatment effects when they do not really exist.

Both the previous discussion on DID in a GLM context and combining matching with DID indicate the risks involved in just plug and play causal inference and the challenges of bridging the gap between theory and application.


Daw, J. R. and Hatfield, L. A. (2018), Matching and Regression to the Mean in Difference‐in‐Differences Analysis. Health Serv Res, 53: 4138-4156. doi:10.1111/1475-6773.12993

Daw, J. R. and Hatfield, L. A. (2018), Matching in Difference‐in‐Differences: between a Rock and a Hard Place. Health Serv Res, 53: 4111-4117. doi:10.1111/1475-6773.13017

Thursday, January 24, 2019

Modeling Claims with Linear vs. Non-Linear Difference-in-Difference Models

Previously I have discussed the issues with modeling claims costs. Typically medical claims exhibit non-negative highly skewed values with high zero mass and heterskedasticity. The most commonly suggested approach to addressing these distributional concerns in the literature call for the use of non-linear GLM models.  However, as previously discussed (see here and here) there are challenges with using difference-in-difference models in the context of GLM models. So once again, the gap between theory and application presents challenges, tradeoffs, and compromises that need to be made by the applied econometrician.

In the past I have written about the accepted (although controversial in some circles) practice of leveraging linear probability models to estimate marginal effects in applied work when outcomes are dichotomous. But what about doing this in the context of claims analysis? In my original post regarding the challenges of using difference-in-differences with claims I speculated:

"So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For count outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative."

 I have found that this advice is pretty consistent with the social norms and practices in the field.

In their analysis of the ACA Cantor, et al (2012) leverage linear probability models for difference-in-differences for healthcare utilization stating:

"Linear probability models are fit to produce coefficients that are direct estimates of the relevant policy impacts and are easily interpreted as percentage point changes in coverage outcomes. This approach has been applied in earlier evaluations of insurance market reforms (Buchmueller and DiNardo 2002; Monheit and Steinberg Schone 2004;  Levine, McKnight, and Heep 2011;  Monheit et al. 2011). It also avoids complications associated with estimation and interpretation of multiple interaction terms and their standard errors in logit or probit models (Ai and Norton 2003)."

Jhamb et al (2015) use LPMs for dichotomous outcomes as well as OLS models for counts in a DID framework.

Interestingly, Deb and Norton (2018) discuss an approach to address the challenges of DID in a GLM framework head on:

"Puhani argued, using the potential outcomes framework, that the treatment effect on the treated in the difference-in-difference regression equals the expected value of the dependent variable for the treatment group in the post period with treatment compared with the hypothetical expected value of the dependent variable for the treatment group in the post period if they had not received treatment. In nonlinear models, the treatment effect on the treated equals the difference in two predicted values. It always has the same sign as the coefficient on the interaction term. Because we estimate many nonlinear models using a difference-in-differences study design, we report the treatment effect on the treated in all tables of results."

In presenting their results they compare their GLM based approach to results from linear models of healthcare expenditures. While they argue the differences are substantial in supporting their approach, I did not find the OLS estimate (-$323.4) to be practically different from the second part (conditional on positive) of the two part GLM model (-$321.4), although the combined results from the two part model had large practical differences from OLS. It does not appear they compared a two-part GLM to a two-part linear model (which could be problematic if the first part OLS model gave probabilities greater than 1 or less than zero). In their paper they cited a number of authors using linear difference-in-differences to model claims you will find below.

See the references below for a number of examples (including those cited above).

Related: Linear Literalism and Fundamentalist Econometrics


Cantor JC, Monheit AC, DeLia D, Lloyd K. Early impact of the Affordable Care Act on health insurance coverage of young adults. Health Serv Res. 2012;47(5):1773-90.

Modeling Health Care Expenditures and Use
Partha Deb and Edward C. Norton
Annual Review of Public Health 2018 39:1, 489-505

Buchmueller T, DiNardo J. “Did Community Rating Induce an Adverse Selection Death Spiral? Evidence from New York, Pennsylvania and Connecticut” American Economic Review. 2002;92(1):280–94.

Monheit AC, Cantor JC, DeLia D, Belloff D. “How Have State Policies to Expand Dependent Coverage Affected the Health Insurance Status of Young Adults?” Health Services Research. 2011;46(1 Pt 2):251–67

Amuedo-Dorantes C, Yaya ME. 2016. The impact of the ACA’s extension of coverage to dependents on young adults’ access to care and prescription drugs. South. Econ. J. 83:25–44

Barbaresco S, Courtemanche CJ, Qi Y. 2015. Impacts of the Affordable Care Act dependent coverage provision on health-related outcomes of young adults. J. Health Econ. 40:54–68

Jhamb J, Dave D, Colman G. 2015. The Patient Protection and Affordable Care Act and the utilization of health care services among young adults. Int. J. Health Econ. Dev. 1:8–25

Sommers BD, Buchmueller T, Decker SL, Carey C, Kronick R. 2013. The Affordable Care Act has led
to significant gains in health insurance and access to care for young adults. Health Aff. 32:165–74

Modeling Healthcare Claims as a Dependent Variable

Healthcare claims present challenges to the applied econometrician. Claims costs typically exhibit a large number of zero values (high zero mass), extreme skewness, and heteroskedasticity. Below is a histogram depicting the distributional properties typical of claims data.

The literature (see references below) addresses a number of approaches (i.e. log models, GLM, and two part models) often used for modeling claims data. However, without proper context the literature can leave one with a lot of unanswered questions, or several seemingly plausible answers to the same question.

The department of Veteran's Affairs runs a series of healthcare econometrics cyberseminars covering these topics. Particularly, they have two video lectures devoted to modeling healthcare costs as a dependent variable.

Principles discussed include:

1) Despite what is taught in a lot of statistics classes about skewed data, in claims analysis we usually DO want to look at MEANS not MEDIANS.

2) Why logging claims and then running analysis on the logged data to deal with skewness is probably not the best practice in this context.

3) How adding a small constant number to zero values prior to logging can lead to estimates that are very sensitive to the choice of constant value.

4) Why in many cases it could be a bad idea to exclude ‘high cost claimants’ from an analysis without good reasons. This probably should not be an arbitrary routine practice.

5)When and why you may or may not prefer ‘2-part models’

Note: Utilization data like ER visits, primary care visits and hospital admissions are also typically non-negative and skewed with high mass points.  Utilization can be modeled as counts using poisson, negative binomial, or zero-inflated poisson and zero inflated negative binomial models in a GLM framework although not discussed here.


Mullahy, John. "Much Ado Abut Two: Reconsidering Retransformation And The Two-Part Model In Health Econometrics," Journal of Health Economics, 1998, v17(3,Jun), 247-281.

Liu L, Cowen ME, Strawderman RL, Shih Y-CT. A Flexible Two-Part Random Effects Model for Correlated Medical Costs. Journal of health economics. 2010;29(1):110-123. doi:10.1016/j.jhealeco.2009.11.010.

Too much ado about two-part models
and transformation? Comparing methods of modeling Medicare expenditures
Melinda Beeuwkes Buntin a,∗, Alan M. Zaslavsky
Journal of Health Economics 23 (2004) 525–542

Health Econ. 20: 897–916 (2011)

Generalized modeling approaches to risk adjustment of skewed outcomes data.
J Health Econ. 2005 May;24(3):465-88.
Manning WG1, Basu A, Mullahy J.

Econometric Modeling of Health Care Costs and Expenditures: A Survey of Analytical Issues and Related Policy Considerations . John Mullahy. Medical Care. Vol. 47, No. 7, Supplement 1: Health Care Costing: Data, Methods, Future Directions (Jul., 2009), pp. S104-S108

Analyzing Health Care Costs: A Comparison of
Statistical Methods Motivated by Medicare Colorectal Cancer Charges. MICHAEL GRISWOLD, GIOVANNI PARMIGIANI,ARNIE POTOSKY,JOSEPH LIPSCOMB. Biostatistics (2004), 1, 1, pp. 1–23

Estimating log models: to transform or not to transform? Willard G. Manning and John Mullahy. Journal of Health Economics 20 (2001) 461–494

Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.

P Dier, D Yanez, A Ash, M Hornbrook, DY Lin. Methods for analyzing health care utilization and costs Ann Rev Public Health (1999) 20:125-144
Lachenbruch P. A. 2001. “Comparisons of two-part models with competitors” Statistics in Medicine, 20:1215–1234.

Lachenbruch P.A. 2001. “Power and sample size requirements for two-part models” Statistics in Medicine, 20:1235–1238.

 Diehr,P. ,Yanez,D. Ash, A. Hornbrook, M. & Lin, D. Y. 1999 “Methods for analyzing health care utilization and costs.” Annu. Rev. Public Health, 20:125–44.

Friday, December 21, 2018

Thinking About Confidence Intervals: Horseshoes and Hand Grenades

In a previous post, Confidence Intervals: Fad or Fashion I wrote about Dave Giles' post on interpreting confidence intervals. A primary focus of these discussions was how confidence intervals are often mis-interpreted. For instance the two statements below are common mischaracterizations of CIs:

1) There's a 95% probability that the true value of the regression coefficient lies in the interval [a,b].
2) This interval includes the true value of the regression coefficient 95% of the time.

You can read the previous post or Dave's post for more details. But in re-reading Dave's post myself recently one statement had me thinking:

"So, the first interpretation I gave for the confidence interval in the opening paragraph above is clearly wrong. The correct probability there is not 95% - it's either zero or 100%! The second interpretation is also wrong. "This interval" doesn't include the true value 95% of the time. Instead, 95% of such intervals will cover the true value."

I like the way he put that...'95% of such intervals' distinguishing this from a particular observed/calculated confidence interval. I think someone trained to think about CIs in the incorrect probabilistic way may have trouble getting at this. So how might we think about this in a way that captures CIs in a way that is still useful, but doesn't get us tripped up with incorrect probability statements?

My favorite statistics text is Degroot's Probability and Statistics. In the 4th edition they are very careful about explaining confidence intervals:

"Once we compute the observed values of a and b, the observed interval (a,b) is not so easy to interpret....Before observing the data we can be 95% confident that the random interval (A,B) will contain mu, but after observing the data, the safest interpretation is that (a,b) is simply the observed value of the random interval (A,B)"

While Degroot is careful, it still may not be very intuitive. However, in Principles and Procedures of Statistics: A Biometrical Approach (Steel, Torie, and Dickey) they present a more intuitive explanation.

"since mu will either be or not be in the interval, that is P=0 or 1, the probability will actually be a measure of confidence we placed in the procedure that led to the statement. This is like throwing a ring at a fixed post; the ring doesn't land in the same position or even catch on the post every time. However we are able to say that we can circle the post 9 times out of 10, or whatever the value should be for the measure of our confidence in our proficiency."

The ring tossing analogy seems to work pretty well. I'll customize it by using horseshoes instead. Yes 95 out of 100 times you might throw a ringer (in the game of horseshoes that is when the horse shoe circles the peg or stake when you toss it). You know this before you toss it. And to use Dave Giles language, *before* calculating a confidence interval we know that 95% of such intervals will cover the population parameter of interest. And, after we toss the shoe, it either circles the peg or not, that is a 1 or a 0 in terms of probability. Similarly, *after* computing a confidence interval, the true mean or population parameter of interest is covered or not with a probability of 0 or 100%.

This isn't perfect, but thinking of confidence intervals this way at least keeps us honest about making probability statements.

Going back to my previous post, I still like the description of confidence intervals Angrist and Pishke provide in Mastering 'Metrics, that is 'describing a set of parameter values consistent with our data.' 

For instance if we run the regression:

y = b0 + b1X + e  to estimate y = B0 + B1 + e

and get our parameter estimate b with a 95% confidence interval like (1.2,1.8), we can say that our sample data is consistent with any population that has a B taking a value that falls in the interval. That implies there are a number of populations that our data would be consistent with. Narrower intervals imply very similar populations, very similar values of B, and speaks to more precision in our estimate of B.

I really can't make an analogy for hand grenades. It just gave me a title with a ring to it.

See also:
Interpreting Confidence Intervals
Bayesian Statistics Confidence Intervals and Regularization
Overconfident Confidence Intervals