The literature (see references below) addresses a number of approaches (i.e. log models, GLM, and two part models) often used for modeling claims data. However, without proper context the literature can leave one with a lot of unanswered questions, or several seemingly plausible answers to the same question.

The department of Veteran's Affairs runs a series of healthcare econometrics cyberseminars covering these topics. Particularly, they have two video lectures devoted to modeling healthcare costs as a dependent variable.

https://www.hsrd.research.va.gov/cyberseminars/series.cfm#hec3

Principles discussed include:

1) Despite what is taught in a lot of statistics classes about skewed data, in claims analysis we usually DO want to look at MEANS not MEDIANS.

2) Why logging claims and then running analysis on the logged data to deal with skewness is probably not the best practice in this context.

3) How adding a small constant number to zero values prior to logging can lead to estimates that are very sensitive to the choice of constant value.

4) Why in many cases it could be a bad idea to exclude ‘high cost claimants’ from an analysis without good reasons. This probably should not be an arbitrary routine practice.

5)When and why you may or may not prefer ‘2-part models’

Note: Utilization data like ER visits, primary care visits and hospital admissions are also typically non-negative and skewed with high mass points. Utilization can be modeled as counts using poisson, negative binomial, or zero-inflated poisson and zero inflated negative binomial models in a GLM framework although not discussed here.

