See also:
Lets consider three count modeling scenarios and determine the appropriate modeling strategy.
Example 1: Suppose we are observing kids
playing basketball during open gym. We have two groups of equal size A
and B. Suppose both groups play for 60 minutes and kids in one group, A,
score on average about 2.5 goals each while
group B averages 5.
In this case both groups of students engage in
activity for the same amount of time. There seems to be no need to
include time as an offset. And it is clear, for whatever reason students
in group B are better at scoring and therefore score
more goals.
If we simulate count data to mimic this scenario (see toy data below) we might get descriptive statistics that look like this:
Table 1:
It is clear for the period of observation (60
minutes) group B out scored A. Would we in practice discuss this in
terms of rates? Total points per 60 minute session? Or total goals per
minute? In this case group A scores .0433 goals per
minute vs. .09 for B. Again, we conclude based on rates that B is
better at scoring goals. But most likely, despite the implicit or
explicit view of rate, we would discuss these outcomes in a more
practical sense, total goals for A vs B.
We could model this difference with a Poisson regression:
Table 2:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9555 0.1601 5.967 2.41e-09 ***
GROUPB 0.7309 0.1949 3.750 0.000177 ***
We can from this that group B completes
significantly more goals than A, at a ‘rate’ exp(.7309) = 2.075 times
that of A. (roughly twice as many goals). This is basically what we get
from a direct comparison of the average counts in the
descriptives above.
But what if we wanted to be explicit about the
interval of observation and include an offset? The way we incorporate
rates into a poisson model for counts is through the offset.
Log(μ/tx) = xβ here we are explicitly specifying a rate based on time ‘tx’
Re-arranging terms we get:
Log(μ) – Log(tx) = xβ
Log(μ) = xβ + Log(tx)
The term Log(tx) becomes our ‘offset.’
So we would do this by including log(time) as an offset in our R code:
summary(glm(COUNT ~ GROUP + offset(log(TIME2)),data = counts, family = poisson))
It turns out the estimate of .7309 for B vs A is
the same. Whether we directly compare the raw counts, or run count
models with or without offsets we get the same result.
Table 3:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.1388 0.1601 -19.60 < 2e-16 ***
GROUPB 0.7309 0.1949 3.75 0.000177 ***
Example 2: Suppose again we are observing
kids playing basketball during open gym. Let’s still refer to them as
groups A and B. Suppose after 30 minutes group A is forced to leave the
court (maybe their section of the court is reserved
for an art show). Before leaving they score an average of about 2.5
goals. Group B is allowed to play for 60 minutes scoring an average of
about 5 goals. This is an example where the two groups had different
observation times or exposure times (i.e. playing
time). Its plausible that if Group A continued to play longer they
would have had more risk or opportunity to score more goals. It seems
the only fair way to compare goal scoring for A vs B is to consider
court time, or exposure or the rate of goal completion.
If we use the same toy data as before (but assuming this different
scenario) we would get the following descriptives:
Table 4
You can see that the difference in the rate of
goals scored is very small. Both teams are put on an ‘even’ playing
field when we consider rates of goal completion.
If we fail to consider exposure or the period of observation we run the following model:
summary(glm(COUNT ~ GROUP,data = counts, family = poisson))
The results will appear the same as in table 2
above. But what if we want to consider the differences in exposure or
observation time? In this case we would include an offset in our model
specification:
summary(glm(COUNT ~ GROUP + offset(log(TIME3)),data = counts, family = poisson))
Table 5
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.44569 0.16013 -15.273 <2e-16 ***
GROUPB 0.03774 0.19490 0.194 0.846
We can see from the results that when considering
exposure (modeling with an offset) there is no significant difference
between groups, although this could be an issue of low power and small
sample size. Directionally group B completes
about 3.8% more goals (per minute of exposure) than A or alternatively
exp(.0377) = 1.038 indicates that B completes 1.038 times as many goals
as A or alternatively (1.038-1)*100 = 3.8% more. We can get all of this
from the descriptives by comparing the average
‘rates’ of goal completion for B vs A. But the conclusion is all the
same, and if we fail to consider rates or exposure in this case we get
the wrong answer!!!
Example 3: Suppose we are again observing
kids playing basketball during open gym with groups A and B. Except this
time group A tires out after playing about 20 minutes or so and leaves
the court after scoring 2.6 goals each on average.
Group B perseveres another 30 minutes or so and scores a total of about
5 goals on average per student. In this instance there seem to be
important differences in group A and B in terms of drive and ambition
that should not be equated by accounting for time
played or inclusion of an offset. Event success seems to drive time as
much as time drives the event. In this instance if we want to think of a
‘rate’ the rate is total goals scored per open gym session, not per
minute of activity. The relevant interval is
a single open gym session.
In this case time
actually seems endogenous or confounded with the outcome or confounded with other factors like effort and motivation which drive the outcome.
If we alter our simulated data from before to mimic this scenario we would generate the following descriptive statistics:
Table 6:
As discussed previously, this should be modeled
without an offset, implying equal exposure/observation time with regard
to the event or exposure being an entire open gym session. We can think
of this as a model of counts, or an implied
model of rates in terms of total goals per open gym session. In that
case we get the same results as table 2 indicating that group B scores
more goals than A. It makes no sense in this case to include time as an
offset or compare rates of goal completion
between groups. But, if we did model this with an offset (making this a
model with an explicit specification of exposure being court time) then
we would get the following:
Table7 :
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1203 0.1601 -13.241 <2e-16 ***
GROUPB -0.1054 0.1949 -0.541 0.589
In this case we find that modeling this explicitly
using playing time as exposure we get a result indicating that group B
completes fewer goals or completes goals at a rate lower than group A.
This approach completely ignores the fact that
group A had persevered to play longer and ultimately complete more
goals. Including an offset in this case most likely leads to the wrong
conclusion.
Summary: When modeling outcomes that are
counts a rate is always implied by the nature of the probability mass
function for a Poisson process. However, in practical applications we
may not always think of our outcome as an explicit
rate based on an explicit interval or exposure time. In some cases this
distinction can be critical. When we want to explicitly consider
differences in exposure this is done through specification of an offset
in our count model. Three examples were given using
toy data where (1) modeling rates or including an offset made no
difference in outcome (2) including an offset was required to obtain the
correct conclusion and (3) including an offset may lead to the wrong
conclusion.
Conclusion: Counts always occur within some
interval of time or space and therefore can always have an implicit
‘rate’ interpretation. If counts are observed across different intervals
in time or space for different observations
then differences in outcomes should be modeled through the
specification of an offset. Whether to include an offset really depends
on answering the questions: (1) What is the relevant interval in time
or space upon which our counts are based? (2) Is this
interval different across our observations of counts?
References:
Essentials of Count Data Regression. A. Colin
Cameron and Pravin K. Trivedi. (1999)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Models for Count Outcomes. Richard Williams, University of Notre Dame,
http://www3.nd.edu/~rwilliam/ . Last revised February 16, 2016
Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.
Notes: This ignores any discussion related
to overdispersion or inflated zeros which relate to other possible model
specifications including negative binomial or zero-inflated poisson
(ZIP) or zero-inflated negative binomial (ZINB)
models.
Simulated Toy Count Data:
COUNT GROUP ID TIME TIME2 TIME3
3 A 1 20 60 30
4 A 2 25 60 30
2 A 3 20 60 30
2 A 4 20 60 30
1 A 5 20 60 30
6 A 6 30 60 30
0 A 7 20 60 30
0 A 8 20 60 30
1 A 9 20 60 30
5 A 10 25 60 30
3 A 11 20 60 30
2 A 12 20 60 30
3 A 13 20 60 30
3 A 14 25 60 30
4 A 15 20 60 30
5 B 16 50 60 60
4 B 17 45 60 60
7 B 18 55 60 60
8 B 19 50 60 60
3 B 20 50 60 60
7 B 21 45 60 60
5 B 22 55 60 60
4 B 23 50 60 60
7 B 24 50 60 60
8 B 25 45 60 60
5 B 26 55 60 60
3 B 27 50 60 60
5 B 28 50 60 60
4 B 29 45 60 60
6 B 30 55 60 60