Thursday, January 24, 2019

Modeling Claims with Linear vs. Non-Linear Difference-in-Difference Models

Previously I have discussed the issues with modeling claims costs. Typically medical claims exhibit non-negative highly skewed values with high zero mass and heterskedasticity. The most commonly suggested approach to addressing these distributional concerns in the literature call for the use of non-linear GLM models.  However, as previously discussed (see here and here) there are challenges with using difference-in-difference models in the context of GLM models. So once again, the gap between theory and application presents challenges, tradeoffs, and compromises that need to be made by the applied econometrician.

In the past I have written about the accepted (although controversial in some circles) practice of leveraging linear probability models to estimate marginal effects in applied work when outcomes are dichotomous. But what about doing this in the context of claims analysis? In my original post regarding the challenges of using difference-in-differences with claims I speculated:

"So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For count outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative."

 I have found that this advice is pretty consistent with the social norms and practices in the field.

In their analysis of the ACA Cantor, et al (2012) leverage linear probability models for difference-in-differences for healthcare utilization stating:

"Linear probability models are fit to produce coefficients that are direct estimates of the relevant policy impacts and are easily interpreted as percentage point changes in coverage outcomes. This approach has been applied in earlier evaluations of insurance market reforms (Buchmueller and DiNardo 2002; Monheit and Steinberg Schone 2004;  Levine, McKnight, and Heep 2011;  Monheit et al. 2011). It also avoids complications associated with estimation and interpretation of multiple interaction terms and their standard errors in logit or probit models (Ai and Norton 2003)."

Jhamb et al (2015) use LPMs for dichotomous outcomes as well as OLS models for counts in a DID framework.

Interestingly, Deb and Norton (2018) discuss an approach to address the challenges of DID in a GLM framework head on:

"Puhani argued, using the potential outcomes framework, that the treatment effect on the treated in the difference-in-difference regression equals the expected value of the dependent variable for the treatment group in the post period with treatment compared with the hypothetical expected value of the dependent variable for the treatment group in the post period if they had not received treatment. In nonlinear models, the treatment effect on the treated equals the difference in two predicted values. It always has the same sign as the coefficient on the interaction term. Because we estimate many nonlinear models using a difference-in-differences study design, we report the treatment effect on the treated in all tables of results."

In presenting their results they compare their GLM based approach to results from linear models of healthcare expenditures. While they argue the differences are substantial in supporting their approach, I did not find the OLS estimate (-$323.4) to be practically different from the second part (conditional on positive) of the two part GLM model (-$321.4), although the combined results from the two part model had large practical differences from OLS. It does not appear they compared a two-part GLM to a two-part linear model (which could be problematic if the first part OLS model gave probabilities greater than 1 or less than zero). In their paper they cited a number of authors using linear difference-in-differences to model claims you will find below.

See the references below for a number of examples (including those cited above).

Related: Linear Literalism and Fundamentalist Econometrics


Cantor JC, Monheit AC, DeLia D, Lloyd K. Early impact of the Affordable Care Act on health insurance coverage of young adults. Health Serv Res. 2012;47(5):1773-90.

Modeling Health Care Expenditures and Use
Partha Deb and Edward C. Norton
Annual Review of Public Health 2018 39:1, 489-505

Buchmueller T, DiNardo J. “Did Community Rating Induce an Adverse Selection Death Spiral? Evidence from New York, Pennsylvania and Connecticut” American Economic Review. 2002;92(1):280–94.

Monheit AC, Cantor JC, DeLia D, Belloff D. “How Have State Policies to Expand Dependent Coverage Affected the Health Insurance Status of Young Adults?” Health Services Research. 2011;46(1 Pt 2):251–67

Amuedo-Dorantes C, Yaya ME. 2016. The impact of the ACA’s extension of coverage to dependents on young adults’ access to care and prescription drugs. South. Econ. J. 83:25–44

Barbaresco S, Courtemanche CJ, Qi Y. 2015. Impacts of the Affordable Care Act dependent coverage provision on health-related outcomes of young adults. J. Health Econ. 40:54–68

Jhamb J, Dave D, Colman G. 2015. The Patient Protection and Affordable Care Act and the utilization of health care services among young adults. Int. J. Health Econ. Dev. 1:8–25

Sommers BD, Buchmueller T, Decker SL, Carey C, Kronick R. 2013. The Affordable Care Act has led
to significant gains in health insurance and access to care for young adults. Health Aff. 32:165–74

Modeling Healthcare Claims as a Dependent Variable

Healthcare claims present challenges to the applied econometrician. Claims costs typically exhibit a large number of zero values (high zero mass), extreme skewness, and heteroskedasticity. Below is a histogram depicting the distributional properties typical of claims data.

The literature (see references below) addresses a number of approaches (i.e. log models, GLM, and two part models) often used for modeling claims data. However, without proper context the literature can leave one with a lot of unanswered questions, or several seemingly plausible answers to the same question.

The department of Veteran's Affairs runs a series of healthcare econometrics cyberseminars covering these topics. Particularly, they have two video lectures devoted to modeling healthcare costs as a dependent variable.

Principles discussed include:

1) Despite what is taught in a lot of statistics classes about skewed data, in claims analysis we usually DO want to look at MEANS not MEDIANS.

2) Why logging claims and then running analysis on the logged data to deal with skewness is probably not the best practice in this context.

3) How adding a small constant number to zero values prior to logging can lead to estimates that are very sensitive to the choice of constant value.

4) Why in many cases it could be a bad idea to exclude ‘high cost claimants’ from an analysis without good reasons. This probably should not be an arbitrary routine practice.

5)When and why you may or may not prefer ‘2-part models’

Note: Utilization data like ER visits, primary care visits and hospital admissions are also typically non-negative and skewed with high mass points.  Utilization can be modeled as counts using poisson, negative binomial, or zero-inflated poisson and zero inflated negative binomial models in a GLM framework although not discussed here.


Mullahy, John. "Much Ado Abut Two: Reconsidering Retransformation And The Two-Part Model In Health Econometrics," Journal of Health Economics, 1998, v17(3,Jun), 247-281.

Liu L, Cowen ME, Strawderman RL, Shih Y-CT. A Flexible Two-Part Random Effects Model for Correlated Medical Costs. Journal of health economics. 2010;29(1):110-123. doi:10.1016/j.jhealeco.2009.11.010.

Too much ado about two-part models
and transformation? Comparing methods of modeling Medicare expenditures
Melinda Beeuwkes Buntin a,∗, Alan M. Zaslavsky
Journal of Health Economics 23 (2004) 525–542

Health Econ. 20: 897–916 (2011)

Generalized modeling approaches to risk adjustment of skewed outcomes data.
J Health Econ. 2005 May;24(3):465-88.
Manning WG1, Basu A, Mullahy J.

Econometric Modeling of Health Care Costs and Expenditures: A Survey of Analytical Issues and Related Policy Considerations . John Mullahy. Medical Care. Vol. 47, No. 7, Supplement 1: Health Care Costing: Data, Methods, Future Directions (Jul., 2009), pp. S104-S108

Analyzing Health Care Costs: A Comparison of
Statistical Methods Motivated by Medicare Colorectal Cancer Charges. MICHAEL GRISWOLD, GIOVANNI PARMIGIANI,ARNIE POTOSKY,JOSEPH LIPSCOMB. Biostatistics (2004), 1, 1, pp. 1–23

Estimating log models: to transform or not to transform? Willard G. Manning and John Mullahy. Journal of Health Economics 20 (2001) 461–494

Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.

P Dier, D Yanez, A Ash, M Hornbrook, DY Lin. Methods for analyzing health care utilization and costs Ann Rev Public Health (1999) 20:125-144
Lachenbruch P. A. 2001. “Comparisons of two-part models with competitors” Statistics in Medicine, 20:1215–1234.

Lachenbruch P.A. 2001. “Power and sample size requirements for two-part models” Statistics in Medicine, 20:1235–1238.

 Diehr,P. ,Yanez,D. Ash, A. Hornbrook, M. & Lin, D. Y. 1999 “Methods for analyzing health care utilization and costs.” Annu. Rev. Public Health, 20:125–44.