Healthcare claims present challenges to the applied econometrician. Claims costs typically exhibit a large number of zero values (high zero mass), extreme skewness, and heteroskedasticity. Below is a histogram depicting the distributional properties typical of claims data.
The literature (see references below) addresses a number of approaches (i.e. log models, GLM, and two part models) often used for modeling claims data. However, after spending significant time over the years combing through the literature one can be left with a lot of unanswered questions, or several seemingly plausible answers to the same question. It can be confusing.
The department of Veteran's Affairs runs a series of healthcare econometrics cyberseminars covering these topics. Particularly, they have two videos lectures devoted to modeling healthcare costs as a dependent variable.
Principles discussed include:
1) Despite what is taught in a lot of statistics classes about skewed data, in claims analysis we usually DO want to look at MEANS not MEDIANS.
2) Why logging claims and then running analysis on the logged data to deal with skewness is probably not the best practice in this context.
3) How adding a small constant number to zero values prior to logging can lead to estimates that are very sensitive to the choice of constant value.
4) Why in many cases it could be a bad idea to exclude ‘high cost claimants’ from an analysis without good reasons. This probably should not be an arbitrary routine practice.
5)When and why you may or may not prefer ‘2-part models’
Note: Utilization data like ER visits, primary care visits and hospital admissions are also typically non-negative and skewed with high mass points. Utilization can be modeled as counts using poisson, negative binomial, or zero-inflated poisson and zero inflated negative binomial models in a GLM framework although not discussed here.
Mullahy, John. "Much Ado Abut Two: Reconsidering Retransformation And The Two-Part Model In Health Econometrics," Journal of Health Economics, 1998, v17(3,Jun), 247-281.
Liu L, Cowen ME, Strawderman RL, Shih Y-CT. A Flexible Two-Part Random Effects Model for Correlated Medical Costs. Journal of health economics. 2010;29(1):110-123. doi:10.1016/j.jhealeco.2009.11.010.
Too much ado about two-part models
and transformation? Comparing methods of modeling Medicare expenditures
Melinda Beeuwkes Buntin a,∗, Alan M. Zaslavsky
Journal of Health Economics 23 (2004) 525–542
REVIEW OF STATISTICAL METHODS FOR ANALYSING HEALTHCARE RESOURCES AND COSTS
BORISLAVA MIHAYLOVAa,, ANDREW BRIGGSb, ANTHONY O’HAGANcand SIMON G. THOMPSON
Health Econ. 20: 897–916 (2011)
Generalized modeling approaches to risk adjustment of skewed outcomes data.
J Health Econ. 2005 May;24(3):465-88.
Manning WG1, Basu A, Mullahy J.
Econometric Modeling of Health Care Costs and Expenditures: A Survey of Analytical Issues and Related Policy Considerations . John Mullahy. Medical Care. Vol. 47, No. 7, Supplement 1: Health Care Costing: Data, Methods, Future Directions (Jul., 2009), pp. S104-S108
Analyzing Health Care Costs: A Comparison of
Statistical Methods Motivated by Medicare Colorectal Cancer Charges. MICHAEL GRISWOLD, GIOVANNI PARMIGIANI,ARNIE POTOSKY,JOSEPH LIPSCOMB. Biostatistics (2004), 1, 1, pp. 1–23
Estimating log models: to transform or not to transform? Willard G. Manning and John Mullahy. Journal of Health Economics 20 (2001) 461–494
Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.
P Dier, D Yanez, A Ash, M Hornbrook, DY Lin. Methods for analyzing health care utilization and costs Ann Rev Public Health (1999) 20:125-144
Lachenbruch P. A. 2001. “Comparisons of two-part models with competitors” Statistics in Medicine, 20:1215–1234.
Lachenbruch P.A. 2001. “Power and sample size requirements for two-part models” Statistics in Medicine, 20:1235–1238.
Diehr,P. ,Yanez,D. Ash, A. Hornbrook, M. & Lin, D. Y. 1999 “Methods for analyzing health care utilization and costs.” Annu. Rev. Public Health, 20:125–44.