See also: Count Models with Offsets: Practical Applications using R
Principles of Count Model Regression
Principles of Count Model Regression
Often times we want to model the impact of some
intervention or differences between groups in relation to an outcome
that is a count. Examples of counts may be related to medical care
utilization such as office visits or days in the hospital
or total hospital admissions.
"In all cases the data are concentrated on a few
small discrete values, say 0, 1 and 2; skewed to the left; and
intrinsically heteroskedastic with variance increasing with the mean.
These features motivate the application of special
methods and models for count regression." (Cameron and Trivedi, 1999).
Poisson regression is "the starting point for count data analysis, though it is often inadequate"
(Cameron and Trivedi,1999). The "main focus is the effect of covariates on the frequency of an event, measured by non-negative integer values or counts"(Cameron and Trivedi,1996).
From Cameron and Trivedi (1999) "The Poisson
regression model is derived from the Poisson distribution by
parameterizing the relation between the mean parameter μ and covariates
(regressors) x. The standard assumption is to use the exponential
mean parameterization"
E(Y|x) = μ = exp(xβ)where xβ= β0 +βx
Equivalently:
Log(μ) = xβ
Where β is the change in the log of average counts
given a change in X. Alternatively we can also say that E(Y|x) changes
by a factor of exp(β) . Often exp(β) is interpreted as an ‘incident rate
ratio’ although in many cases ‘rate’ and
‘mean’ are used interchangeably (Williams, 2016).
Are we modeling mean or average counts or rates or does this matter?
We might often think of our data as counts, but
count models assume that counts are always observed within time or
space. This gives rise to an implicit rate interpretation of counts. For
example, the probability mass function for a Poisson
process can be specified as:
P(Y|μ) = exp(-μ)* μ-y / y!
The Poisson process described above gives us the
probability of ‘y’ events occurring during a given interval. The
parameter μ is the expected count or average number of occurrences in
the specified or implied interval. Often but not always
this interval is measured in units of time (i.e. ER visits per year or
total leaks in 1 mile of pipeline). So even if we think of our outcome
as counts we are actually modeling a rate per some implicit interval of
observation whether we think of that explicitly
or not. This is what gives us the incident rate ratio (IRR)
interpretation of exponentiated coefficients. If we think of rates as:
Rate = count/t
Where t = time or interval of observation or
exposure If all participants are assumed to have a common t this
is essentially like dividing by 1 and our rate is for all practical
purposes often interpreted as a count.
If t =1 or everyone is observed for the same period of time, then we are back to just thinking about counts with an implied rate.
As noted in Cameron and Trivedi (1996) count models
and duration (survival) models can be viewed as duals. If observed
units have different levels of exposure or duration or intervals of
observation this is accounted for in count models
through inclusion of an offset. Inclusion of an offset creates a model
that explicitly considers interval ‘tx’ where tx represents exposure time for individuals with covariate value x:
Log(μ/tx) = xβ here we are explicitly specifying a rate based on time ‘tx’
Re-arranging terms we get:
Log(μ) – Log(tx) = xβ
Log(μ) = xβ + Log(tx)
The term Log(tx) is referred to as an ‘offset.’
When should we include an offset?
Karen Grace Martin at the Analysis Factor gives a
great explanation of modeling offsets or exposure in count models. Here is an excerpt:
"What this means theoretically is that by
defining an offset variable, you are only adjusting for the amount of
opportunity an event has.....A patient in for 20 days is twice as likely
to have an incident as a patient in for 10 days....There
is an assumption that the likelihood of events is not changing over
time."
In another post Karen states:
"It is often necessary to include an exposure or
offset parameter in the model to account for the amount of risk each
individual had to the event."
So if there are differences in exposure or
observation times for different observations relevant to the outcome of
interest then it makes sense to account for this by including offsets as
specified above. By explicitly specifying tx
we can account for differences in exposure time or observation periods
unique to each observation. Often the relevant interval of exposure may
be something other than time. Karen gives one example where it might
not make sense to include an offset or account
for time such as the number of words a toddler can say. Another example
might be the number of correct words spelled in a spelling bee. In fact
in this case time may be
endogenous. More correct words spelled by a participant imply a
longer interval of observation, duration, or ‘exposure’. We would not
make our decision about who is a better speller on the basis of time or a
rate such as total correct words per minute.
As all count models implicitly model rates, the implicit and most
relevant interval here would be the contest itself. In practical terms
this simply reverts to being a comparison of raw counts.
References:
Essentials of Count Data Regression. A. Colin
Cameron and Pravin K. Trivedi. (1999)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Models for Count Outcomes. Richard Williams, University of Notre Dame,
http://www3.nd.edu/~rwilliam/ . Last revised February 16, 2016
Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.
No comments:
Post a Comment